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Abstract 


An  approach  is  developed  for  software  development  for  pairallel  processing  systems 
based  on  the  parallel  object-oriented  functional  computational  model  (PROOF)  which 
incorporates  the  functional  paradigm  into  the  object-oriented  paradigm.  Our  ap¬ 
proach  separates  the  architecture-dependent  issues  from  the  software  development, 
cind  consequently  enhances  the  portability  of  the  software.  This  approach  facilitates 
the  software  development  for  any  parallel  processing  system  by  freeing  the  program¬ 
mer  from  the  consideration  of  various  parallelization  aspects  of  the  software,  and 
the  network  topology  of  the  parallel  processing  system.  It  allows  the  exploitation 
of  parallelism  at  various  levels  of  granularity:  object  level  and  method  level  thereby 
making  it  suitable  for  the  development  of  software  for  any  MIMD  machines.  Further¬ 
more,  this  approach  retains  the  benefits  of  both  the  object-oriented  and  the  functional 
paradigms,  such  as  modifiability,  understandability,  portability,  and  parallelizability. 
A  framework  for  software  development  for  parallel  processing  systems  which  consists 
of  object-oriented  analysis,  object  design,  coding  and  transformation  phases  has  been 
established.  Software  development  is  done  in  a  high-level  prototype  parallel  language 
PROOF/L  based  on  the  computation  model  PROOF  and  then  the  code  in  PROOF/L 
is  tr2msformed  to  a  target  language  of  the  MIMD  machine.  The  transformation  of 
the  code  in  PROOF/L  to  a  tMget  language  is  performed  via  a  two-step  translation: 
one  from  the  code  written  in  PROOF/L  to  an  intermediate  form,  and  the  other  from 
the  intermediate  form  to  the  code  in  the  target  language.  An  example  is  given  to 
illustrate  this  approach. 


Keywords-  Parallel  processing  systems,  MIMD  machine  ,  software  development 
framework,  object-oriented  analysis,  object  design,  verification,  partition¬ 
ing,  grain  size  determination,  PROOF,  PROOF/L,  transputers.  _ 
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Chapter  1 


Introduction 


The  speed  of  computers  has  greatly  increased  during  the  past  decade,  especially  with 
the  recent  rapid  development  of  various  commercially  available  parallel  processing 
systems.  Although  such  vast  increase  in  speed  should  satisfy  the  need  for  high  perfor- 
mcince  computing  systems  for  applications  such  as  C3I  (command,  control,  communi¬ 
cation,  and  intelligence)  systems,  space  exploration  mission,  weather  prediction,  and 
telecommunication  systems,  where  there  are  many  interacting  components,  shared  re¬ 
sources,  and  computationally  intensive  tasks,  the  potential  of  such  high  performajice 
computing  systems  cannot  be  realized  without  effective  software  development  meth¬ 
ods  for  such  systems.  Unfortunately  such  methods  are  far  from  being  mature  due 
to  the  additional  complexity  of  concurrency  and  synchronization.  The  lack  of  such 
methods  is  a  major  obstacle  for  the  use  of  parallel  processing  systems  in  various  appli¬ 
cation  areas.  The  goal  of  this  project  is  to  develop  an  effective  software  development 
approach  for  parallel  processing  systems. 

In  this  project  we  have  established  a  framework  for  the  development  of  software  for 
parallel  processing  systems  based  on  the  parallel  object-oriented  functional  computa¬ 
tional  model  (PROOF)  (1,  2]  which  incorporates  the  functional  pauradigm  in  the  object 
oriented  paradigm.  The  main  advantage  of  this  approach  is  that  this  methodology 
separates  the  architecture  dependent  issues  from  the  software  development.  Hence, 
the  programmer  does  not  need  to  be  concerned  with  issues  such  as  synchronization, 
parallelization,  or  the  topology  of  the  parallel  processing  systems  thereby  making  the 
software  development  independent  of  the  architecture  of  the  parallel  processing  sys¬ 
tem  [2]-[4].  The  software  developed  using  this  approach  is  easily  portable  over  a 
variety  of  parallel  computer  architectures.  This  approach  will  allow  the  exploitation 
of  parallelism  at  various  levels  of  granularity  without  sacrificing  the  effectiveness  of 
the  object-oriented  paradigm.  Parallelism  is  exploited  at  both  the  object  level  (coarse 
grain)  and  the  method  level  (fine  grain),  and  hence  our  approach  is  suitable  for  MIMD 
machines  [5,  6].  Software  developed  based  on  this  paradigm  will  reflect  the  pareillel 
structure  of  the  problem  space  which  will  make  the  software  more  understandable  and 
modifiable. 

Our  approach  to  software  development  for  parallel  processing  systems  is  to  use 
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a  parallel  programming  language.  We  have  designed  and  implemented  a  high  level 
prototype  parallel  programming  language  PROOF/L  based  on  PROOF  to  demon¬ 
strate  the  feasibility  of  our  approach.  The  softw^lre  developed  using  PROOF/L  is 
trzmsformed  into  a  suitable  target  language  supported  by  the  parallel  processing  sys¬ 
tem.  This  transformation  is  done  in  two  steps:  The  first  involves  the  translation  of 
PROOF/L  code  to  an  intermediate  program  representation  (IPR).  This  step,  known 
as  the  PROOF/L  front-end  translation,  makes  the  implicit  parallelism  in  PROOF/L 
code  explicit.  The  second  involves  the  translation  of  the  IPR  into  the  target  language. 
Thi*.,  step  is  known  as  PROOF/L  back-end  translation. 

In  this  report,  we  will  briefly  summarize  various  existing  approaches  to  software 
development  for  parallel  processing  systems  and  provide  the  necessary  background 
information  for  PROOF  in  Chapter  2.  Our  overall  approach  to  software  development 
for  such  systems  will  be  elucidated  in  Chapter  3.  An  approach  to  partition  the  software 
system  into  a  set  of  clusters  so  that  coarse  grain  par2Jlelism  can  be  exploited  is  given  in 
Chapter  4.  In  Chapters  5  through  7,  we  will  present  the  translation  from  PROOF/L 
code  to  the  target  code.  The  PROOF/L  front-end  translation  will  be  described  in 
Chapter  5.  In  Chapter  7  we  will  describe  the  translation  process  from  IPR  to  target 
language  of  the  parallel  processing  system.  We  will  also  discuss  the  implementation 
issues  involved  in  such  a  translation.  The  PROOF/L  back-end  translation  requires 
additional  information  regarding  the  grain  size  to  produce  the  code  which  can  be 
executed  efficiently  on  the  underlying  parallel  processing  system.  In  Chapter  6,  we 
will  present  an  approach  to  grain  size  analysis  on  various  patterns  of  parallelism.  In 
Chapter  8  we  will  give  an  example  to  illustrate  this  approach.  Finally,  the  conclusions 
and  future  direction  of  this  research  will  be  given  in  Chapter  9. 
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Chapter  2 


Background 


2.1  Overview  of  Parallel  Programming  Approaches 


Approaches  to  programming  parallel  processing  systems  can  be  classified  in  three  cat¬ 
egories:  One  category  is  to  write  programs  using  conventional  sequential  programming 
languages,  such  as  Fortran  [7,  8,  9]  or  C  (7,  10],  and  then  parallelize  the  programs 
using  parallelizing  or  vectorizing  compilers.  Although  this  type  of  approaches  seems 
to  be  attractive  since  many  existing  sequential  software  can  be  adapted  to  such  a  par¬ 
allel  programming  environment  with  minor  modifications,  it  is  hardly  effective.  The 
reasons  for  this  are  that  the  parallelizing  or  vectorizing  compilers  cannot  unravel  most 
of  the  parallelism,  can  only  detect  par2dlelism  associated  with  iterations  over  common 
data  structures,  such  as  arrays  and  matrices  [11],  and  require  extensive  dependency 
£inalysis  [12].  Furthermore,  because  sequentiad  languages  are  extended  with  compiler 
directives  in  order  to  help  the  compilers  detect  parallelism  and  because  these  exten¬ 
sions  are  machine-specific,  portability  of  programs  is  hampered.  The  problems  in 
this  type  of  approaches  are  not  the  compilers  themselves,  but  are  due  to  the  inherent 
sequential  characteristics  of  imperative  programming  languages  since  these  languages 
are  designed  for  sequential  execution  in  sequential  processors.  Although  this  type 
of  approaches  is  very  popular  in  scientific  application  areas,  it  is  not  promising  for 
vairious  applications  of  parallel  processing  systems. 

The  second  category  of  approaches  is  to  use  parallel  language  constructs  to  explic¬ 
itly  model  the  parallelism  in  programs.  These  pjirallel  language  constructs  include  the 
parallel  statements  and  input,  output  commands  in  CSP  [13],  monitor  and  wait,  sig¬ 
nal  operations  in  Concurrent  Pascal  [14],  £knd  task  and  rendezvous  mechanisms  in  Ada 
[15].  Although  the  imperative  languages  in  this  category  of  approaches  are  extended 
with  some  language  constructs,  the  basic  model  of  computation  is  still  sequential.  Us¬ 
ing  the  parallel  language  constructs  will  not  reduce  the  programmer’s  responsibility 
to  explicitly  express  the  parallelism  and  ensure  the  correct  communication  and  syn¬ 
chronization  among  parallel  units,  which  are  extremely  complex  tasks.  In  addition, 
these  parallel  language  constructs  are  only  suitable  to  express  coarse  grain  parallelism. 
Thus,  massive  and  fine  parallelism  cannot  be  expressed  in  this  type  of  approaches. 
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The  third  category  of  approaches  is  to  use  parallel  programming  languages,  such  as 
Id  Nouveau  [16],  and  SISAL  [17],  which  are  functional  languages  tailored  for  scientific 
computation,  PARLOG  [18],  a  parallel  logic  language,  and  Act  1  [19],  an  object- 
based  language  based  on  the  Actor  model.  The  underlying  computation  models  of 
these  parcel  programming  languages  zu'e  fundamentally  different  from  the  underly¬ 
ing  models  of  imperative  programming  languages  in  that  parallelism  is  mostly  implicit 
and  massive  parallelism  can  easily  be  obtainable.  Hence,  the  programmer  using  these 
languages  is  liberated  from  the  complications  caused  by  parallelism.  This  type  of 
approaches  is  considered  most  promising  in  parallel  programming.  However,  the  ex¬ 
isting  approaches  based  on  parallel  programming  languages  have  some  or  all  of  the 
following  disadvantages: 

•  Software  engineering  concepts  for  managing  paraJlelism  have  not  been  fully  in¬ 
corporated  in  these  approaches. 

•  Most  of  the  approaches  are  targeted  for  the  shared  memory  processors. 

•  The  concept  of  shared  data  has  not  been  introduced  into  programming  parallel 
processors. 

•  Coding  of  correct  synchronization  and  communication  using  explicit  constructs 
is  still  the  programmer’s  responsibility. 

Our  approach  falls  in  the  third  category  because  it  uses  the  parallel  programming 
language  PROOF/L.  However,  our  approach  will  overcome  the  above  difficulties. 


2.2  PROOF  and  PROOF/L 

Our  approach  is  to  use  a  parallel  programming  language  PROOF/L  based  on  the 
computation  model  PROOF  which  incorporates  the  functional  paradigm  into  the 
object-oriented  paradigm.  In  this  section,  for  the  sake  of  completeness,  we  will  sum¬ 
marize  the  important  features  of  PROOF  and  PROOF/L  which  will  be  used  in  our 
approach.  For  more  detailed  information,  the  reader  is  referred  to  [1] 

An  object-oriented  programming  model  naturally  reveals  existing  parallelism  in  the 
application  problems  [20].  Besides  the  advantages  of  modifiability,  maintainability  and 
reusability,  one  significant  advantage  of  the  object-oriented  model  over  others  is  that 
the  concept  of  an  object  can  be  used  at  earlier  stages  of  software  development  cycles 
than  the  implementation  stage.  It  implies  that  parallel  processing  aspects  such  as 
parallelism  and  commimication  among  parallel  components  can  be  naturally  handled 
at  the  earlier  stage  of  the  software  development.  Consequently,  it  is  easy  for  the 
programmer  to  handle  parallelism  and  communication  among  parallel  components. 
However,  in  the  object-oriented  model,  parallel  execution  of  concurrent  objects  is 
the  only  source  of  parallelism,  and  hence  the  parallelism  to  be  exploited  may  not  be 
sufficient  for  exploiting  fine-grain  parallelism. 
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On  the  other  hand,  functional  languages  based  on  the  functional  paradigm  are 
referentially  transparent  and  race  conditions  cannot  be  introduced.  As  a  result,  this 
paradigm  has  great  potential  of  exploiting  implicit  parallelism  by  removing  side-effects 
caused  by  assignment  statements.  Functional  programming  has  been  one  of  the  main 
directions  in  developing  new  languages  that  directly  address  the  challenge  of  parallel 
programs,  i.e.  parallelism  can  be  easily  detected  and  exploited.  However,  due  to 
its  history  insensitivity,  the  expressive  power  of  programming  is  limited  and  is  not 
suitable  for  expressing  inherently  concurrent  nature. 

PROOF  incorporates  the  functional  paradigm  in  the  object-oriented  paradigm  by 
supporting  object-oriented  features,  such  as  object,  class  and  inheritance,  and  defin¬ 
ing  methods  as  purely  applicative  functions.  Thus,  PROOF  allows  the  exploita¬ 
tion  of  massive  parallelism  without  sacrificing  the  effectiveness  of  the  object-oriented 
p£Lr2uligm.  In  PROOF,  each  object  is  an  instance  of  a  class,  zmd  can  be  either  passive 
or  active.  A  passive  object  acts  like  a  service  agency.  It  waits  passively  until  one  of 
its  methods  is  invoked  by  some  other  objects.  A  passive  object  may  in  turn  invoke 
methods  in  other  objects.  An  active  object  is  active  initially,  and  it  may  remain  ac¬ 
tive  throughout  its  execution,  except  for  occasional  suspensions  for  the  purpose  of 
synchronization  with  other  objects.  A  body  will  be  attached  to  each  active  object.  A 
class  is  a  template  for  a  set  of  objects  bearing  similsu'  behavior,  and  it  is  defined  as 
a  generic  abstract  data  type.  A  class  is  defined  by  its  interface  and  definition.  The 
class  interface  describes  the  types  of  the  methods  provided  by  the  class.  The  class 
definition  consists  of  the  composition  of  its  local  data  and  the  definition  of  each  of  its 
methods,  which  are  purely  applicative  functions.  Methods  are  defined  as  purely  ap¬ 
plicative  functions  or  functional  forms,  i.e.,  high-order  functions.  We  use  a  constructor 
[ii,  xj, . . . ,  x„]  to  denote  a  sequence  of  homogeneous  or  heterogeneous  elements.  In 
the  case  of  homogeneous  elements,  it  denotes  a  list  or  an  array  whose  types  eure  T* 
and  r"(=  T  X  T  X  ...  X  T)  respectively.  In  the  case  of  heterogeneous  elements,  it 

n 

denotes  a  Cartesian  product  whose  type  is  nr=i  Ti  {=T\  xTi  x  . . .  x  r„). 

PROOF  assumes  that  there  is  a  set  of  primitive  functions  and  functional  forms  from 
which  other  functions  zmd  functional  forms  can  be  easily  constructed.  The  following 
are  some  of  the  functions  and  functional  forms  in  PROOF. 

a)  Functional  form:  a  (called  apply  to  alt) 

Type:  (T,  Tj)  T;  T; 

>  ®n]  =  (/(3Jl)» .  .  .  ,/(Xn)] 

a  has  two  parameters,  a  function  of  type  Ti  -♦  Tj,  and  a  list  of  homogeneous  elements 
of  type  Ti.  The  function  /  is  applied  to  each  element  in  the  list  and  yields  a  list  of 
elements  of  type  Tj. 

b)  Functional  form:  P  (called  distributed  apply) 
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Type: ->  n?=.  i?"  ->  ns., 

^(/l»  fit  •  •  •  >  /?»][®1»  •  •  •  1  ®n]  =  [/l(®l)»  /2(^2)»  •  •  •  >  /iiC^Sn)] 


0  has  two  psirameters,  a  list  of  functions  in  which  each  function  /,•  is  of  type  —* 
Ti^\  and  a  list  of  heterogeneous  elements  in  which  the  ith  element  is  of  type  Ti^\ 
Each  function  in  the  first  list  is  applied  to  the  corresponding  element  in  the  second 
list.  It  yields  a  list  in  which  the  ith  element  is  of  type 

c)  Function  7  (called  filter) 

Type:  boor  T“  r*,0  <  ib  <  n 


7[^J  •  •  •  J  ^2i  •  •  •  y  ®n] 

{[  ],  if  n  =  1  and  61  =  False 
[xj],  if  n  =  1  and  61  =  True 

7[5i  . . .  6fc][xi . . .  ifc]  o  7[6fc+i . . .  finK^ik+i  • .  •  a?n],  if  n  >  1 

Here,  o  denotes  the  concatenation  of  two  lists.  It  is  written  in  infix  form  for  the  sake 
of  readability.  7  has  two  parameters,  a  list  of  booleans  and  a  list  of  any  elements. 
This  function  yields  a  subsequence  of  the  second  list  by  selecting  elements  whose 
corresponding  elements  in  the  first  list  are  True. 

Both  inheritance  and  genericness  are  supported  in  PROOF.  Inheritance  is  used  to 
define  a  subclass  as  a  specialization  of  a  superclass.  In  a  subclass,  all  the  local  data 
and  the  methods  of  its  superclass  are  inherited.  Additionzd  local  data,  new  methods 
may  be  introduced.  The  inherited  methods  may  also  be  overridden  by  a  new  definition 
of  the  method. 

In  PROOF,  the  synchronization  among  the  objects  is  achieved  by  attaching  an 
optional  precondition,  called  guard,  to  each  of  the  methods  in  a  class.  Each  guard  is 
a  predicate.  The  object  which  invokes  the  method  is  suspended  when  the  attached 
guard  evaluates  to  False,  and  it  is  resumed  when  the  guard  becomes  True.  The  guard 
attached  to  a  method  is  defined  in  a  way  that  it  only  depends  on  the  status  of  the 
local  data,  and  does  not  depend  on  the  definition  of  any  other  methods. 

A  major  deficiency  of  the  functional  paradigm  is  its  history-insensitivity.  PROOF 
is  made  history  sensitive  by  making  the  objects  persistent  and  allowing  the  reception 
of  values  by  objects,  i.e.,  the  assignment  of  values  to  objects.  The  local  data  of 
objects  is  persistent.  The  reception  of  values  by  objects  will  modify  the  local  data  of 
objects.  A  pseudo-function  %,  called  the  reception  function,  is  introduced  to  denote 
the  reception  of  a  value  by  an  object. 

K  (  O  1  (£) 

'll  is  not  a  function,  but  can  be  treated  as  a  function.  V,  has  two  parameters:  an 
object  0,  the  recipient,  and  the  expression  e,  to  be  received  by  O.  Expression  c 
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Table  1:  A  multi-mode  locking  mechanism. 


R-Lock 

W-Lock 

M-Lock 

R-Lock 

compatible 

compatible 

incompatible 

W-Lock 

compatible 

incompatible 

incompatible 

M-Lock 

incompatible 

incompatible 

incompatible 

may  contain  applications  of  applicative  functions  only.  This  pseudo-function  can  only 
appear  inside  the  bodies  of  active  objects,  and  may  not  be  nested.  Major  differences 
between  modification  of  objects  through  "R,  and  traditionzil  assignments  are  : 

•  The  evaluation  of  the  expression  e  in  R  Ccin  be  parallel  since  e  contains  only 
applications  of  purely  applicative  functions. 

•  No  partial  modification  to  the  object  O  is  allowed.  The  local  data  of  an  object 
can  only  be  modified  as  a  whole  entity,  i.e.,  its  components  cannot  be  modified 
individually. 

In  PROOF,  an  object  can  simultaneously  participate  in  more  than  one  function 
evaluation.  This  implies  that  there  can  be  more  than  one  attempt  to  modify  the  same 
object  simultaneously.  Simultaneous  modification  of  objects  will  result  in  inconsistent 
and  incorrect  states  of  objects.  It  is  imperative  that  at  any  moment  an  object  can  be 
the  recipient  of  only  one  of  the  function  evaluations  in  which  the  object  is  involved. 
Simultaneous  modification  to  the  same  object  must  be  serialized.  At  any  moment, 
the  status  of  any  object  involved  in  an  expression  falls  into  one  of  the  following  three 
categories: 

read-only  :  The  expression  only  needs  to  re^ld  the  vjJue  of  the  object. 

will-modify  :  The  expression  will  modify  the  object,  but  the  modification  does  not 
occur  at  this  moment. 

modifying  :  The  expression  is  currently  modifying  the  object. 

In  order  to  ensure  the  consistency  and  correctness  of  objects,  a  multi-mode  locking 
mechanism  is  adopted.  There  are  three  different  types  of  locks,  R-Lock,  W-Lock  and 
M-Lock  that  are  associated  with  the  three  status  of  the  object,  read-only,  will- 
modify  and  modifying,  respectively.  A  lock  is  granted  only  when  it  is  compatible 
with  other  locks  granted  for  the  same  object,  according  to  the  compatibility  chart  in 
Table  1. 

The  programming  language  based  on  PROOF  is  called  PROOF/L  [1].  A  PROOF/L 
program  consists  of  a  set  of  objects,  and  its  methods  are  written  based  on  the  func¬ 
tional  paradigm.  Programs,  written  in  PROOF /L  will  liberate  the  progr2unmer  from 
the  burden  of  concerning  synchronization,  parallelization  and  communication  while 
programming  and  will  also  make  the  design  of  the  software  system  independent  of 
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the  parallel  processing  system  architecture  [3,  4]  on  which  the  software  is  to  be  imple¬ 
mented.  As  the  programmer  is  liberated  from  the  burden  of  concerning  synchroniza¬ 
tion,  parallelization,  and  communication  these  tasks  are  performed  by  the  translator 
which  translates  the  PROOF /L  code  into  any  target  language.  This  makes  the  soft¬ 
ware  written  in  PROOF/L  portable  and  easy  to  develop.  The  concrete  syntax  of 
PROOF/L  is  given  in  Section  5.1. 
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Chapter  3 

Overall  Framework 


As  mentioned  before,  our  approach  to  software  development  for  parallel  processing 
system  is  based  on  the  computation  model  PROOF  [1,  2]  which  incorporates  the 
functional  paradigm  into  the  object-oriented  paradigm.  The  object-oriented  paradigm 
reflects  the  parallel  structure  of  the  problem  space  and  is  suitable  for  representing 
inherently  concurrent  behavior,  and  the  functional  paradigm  allows  us  to  exploit  the 
parallelism  in  each  object.  Our  framework  consists  of  the  following  phases:  object- 
oriented  analysis,  object  design,  verification,  coding,  transformation  from  PROOF/L 
to  any  target  language  of  a  parallel  processing  system  as  shown  in  Figure  3.1.  In  this 
chapter,  we  will  give  an  overview  of  our  framework. 


3.1  Object-oriented  Analysis 


In  the  object-oriented  software  development,  there  is  no  clear  distinction  between  the 
software  requirement  analysis  phase  and  the  system  design  phase  since  the  elements 
of  interest  in  each  phase  are  still  the  same:  the  objects  in  the  system.  Thus,  we  do 
not  make  any  distinction  between  the  two  phaises  and  we  call  the  first  phase  of  our 
framework  as  object-oriented  analysis.  The  object-oriented  analysis  ^  phase  consists 
of  the  following  steps. 

1)  Identify  objects  and  classes. 

2)  Determine  class  interfaces. 

3)  Specify  dependency  and  communication  relationships  among  objects. 

4)  Identify  active,  passive  and  pseudo-active  objects. 

5)  Identify  the  shared  objects. 

6)  Specify  the  behavior  of  each  of  the  objects. 

^  We  alao  call  the  object-oriented  analyais  in  our  framework  aa  decompoaition  becauae  the  ayatem  ia  decompoaed 
into  a  aet  of  objecta  in  thia  phaae. 
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Executable  Code 
(Target  Language) 


Figure  3.1:  The  various  phases  of  our  framework. 


7)  Identify  bottleneck  objects,  if  any. 

8)  Check  the  completeness  and  consistency. 


3.1.1  Identifying  Objects  and  Classes 

In  this  step,  the  software  system  is  represented  by  a  set  of  communicating  objects. 
Objects  are  identified  by  analyzing  the  semantic  contents  of  the  requirement  specifi¬ 
cations.  All  physical  and  logical  entities  are  recognized.  Each  object  corresponds  to 
a  real-world  entity,  such  as  sensors,  control  devices,  data  and  actions.  Objects  having 
common  behavior  are  grouped  together  to  form  a  class  hierarchy.  The  identificaton 
of  the  objects  is  currently  based  more  on  the  intuition  of  the  developer.  One  of  the 
strategies  to  identify  the  objects  is  by  examining  the  specification  written  in  natural 
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languages.  The  nouns  in  the  specification  can  be  the  candidates  for  the  objects  and 
the  verbs  for  the  operations  [21].  Another  strategy  is  to  draw  the  dataflow  diagrams 
first  and  then  detect  the  candidates  for  objects  from  this  diagram  [22].  In  the  data 
flow  diagrams,  the  function  names  are  represented  in  the  format  of  action.object,  i.e., 
the  first  part  of  the  function  name  denotes  action  and  the  second  part  of  the  function 
name  denotes  object.  Other  techniques  for  identifying  the  objects  are  discussed  in 
[21].  These  techniques  can  be  used  as  guidelines  for  identifying  objects.  However,  the 
experience  and  intuition  of  the  progranuner  play  an  important  role  in  identifying  the 
objects  from  the  requirement  specifications. 


3.1.2  Determining  Class  Interfaces 

In  this  step,  object  class  interfaces  are  determined.  Because  every  object  is  considered 
as  an  instance  of  an  object  class,  instead  of  defining  objects  directly,  the  object  classes 
to  which  they  belong  must  be  defined.  The  interface  of  an  object  class  consists  of  the 
specifications  of  the  methods  provided  by  the  class.  For  each  method,  its  specification 
consists  of  the  input  and  output  parameters  and  their  types.  The  actual  definitions 
of  the  methods  2Lre  hidden  and  will  be  defined  in  a  later  stage.  The  class  interface 
definition  in  PROOF  is  slightly  different  from  that  in  the  conventional  object-oriented 
approach.  Let  m  be  a  method  of  class  C.  In  conventional  object-oriented  approaich, 
the  specification  of  m  may  appear  as  follows:  m  :  I  t-*  O,  where  /  is  the  input 
parameter(s)  of  m  and  0  the  output  parameter(s)  of  m.  Typically,  m  will  also  have 
side-effects  on  the  internal  states  S  of  the  instances  of  C.  In  PROOF,  the  methods  are 
defined  as  applicative  functions.  Therefore,  no  side-effects  will  occur.  The  internal 
state  of  an  object  will  be  an  explicit  input  and/or  output  parameter  of  m  if  the  interned 
state  is  accessed  and/or  modified.  Typically,  in  PROOF  the  interface  of  m  appears 
as  follows: 

m:/x5t--»Ox5. 

The  methods  will  not  directly  modify  the  state  of  the  objects.  Instead,  a  new  state 
of  an  object  will  be  returned  when  the  object  needs  to  be  modified.  The  modification 
of  objects  will  be  achieved  by  a  special  construct  discussed  later. 


3.1.3  Specifying  Dependency  and  Communication  Relationships  Among 
Objects 

In  this  step,  the  static  relationships  among  objects  are  specified  using  the  object  com¬ 
munication  diagrams.  The  identity  of  the  objects,  the  methods  in  each  object  and 
the  relationships  2unong  them  are  specified  to  capture  the  features  of  the  real  world 
problem  which  are  important  for  the  software  developer.  In  the  object  communica¬ 
tion  diagram,  the  objects  are  represented  as  rectangles.  The  links  between  the  objects 
indicate  the  communication  between  objects,  i.e.,  method  invocation.  The  arrows  on 
the  links  indicate  the  directions  of  invocation.  The  methods  defined  in  an  object  as 
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Figure  3.2:  An  object  communication  diagram. 


interface  are  written  within  the  object  with  the  method  names  beginning  with  a  pe¬ 
riod.  The  labels  on  the  arrows  show  which  methods  are  being  invoked  by  an  object. 
Figure  3.2  shows  an  example  of  an  object  communication  diagram,  where  the  objects 
radar,  base  and  safejplace  interact  vrith  each  other  by  invoking  their  methods.  Object 
radar  invokes  the  method  rad-val  defined  in  object  base  and  base  invokes  the  method 
escape  defined  in  the  object  safe-place.  A  more  elaborate  object  communication  dia¬ 
gram  will  be  given  when  we  present  an  example  to  illustrate  our  approach  in  Chapter 
8. 


3.1.4  Identifying  Active,  Passive  and  Pseudo- Active  Objects 

In  this  step,  the  objects  are  classified  according  to  their  invocation  properties  as  active, 
passive  or  pseudo-active.  An  active  object  can  initiate  activation  of  other  objects  by 
invoking  methods  of  other  objects.  The  methods  defined  in  an  active  object  cannot  be 
invoked  by  other  objects,  but  they  can  be  invoked  by  other  methods  defined  within  the 
active  object  itself.  A  passive  object  is  activated  only  when  its  methods  are  invoked  by 
other  objects.  Pseudo-active  objects  behave  between  the  active  and  passive  objects. 
Pseudo-active  objects  can  invoke  the  methods  of  other  passive  or  pseudo-active  objects 
and  also  has  methods  which  can  be  invoked  by  other  active  or  pseudo-active  objects. 
Since  active  objects  are  invoked  when  the  software  system  is  started,  all  the  threads  of 
control  in  the  application  start  from  the  active  objects.  Identifying  all  the  threads  is 
very  important  in  real-time  process  control  systems  because  we  need  this  information 
to  check  for  the  completeness  and  the  consistency  of  the  decomposition.  We  can  easily 
identify  the  active,  passive  and  pseudo-active  objects  from  the  object-commimication 
diagram.  The  active  objects  have  only  outgoing  arrows,  the  passive  objects  have 
only  incoming  arrows  and  the  pseudo-active  objects  have  both  incoming  and  outgoing 
arrows.  Classification  of  objects  by  their  invocation  behavior  helps  us  build  the  static 
structure  of  the  software  system  among  objects. 

3.1.5  Identifying  Shared  Objects 

In  this  step,  once  the  static  structure  of  the  software  system  is  determined,  we  identify 
the  shared  objects  from  them.  An  object  is  a  shared  object  if  it  has  local  data  which  can 
be  accessed  by  a  number  of  objects.  The  shared  objects  can  be  further  divided  into  two 
classes  of  objects:  read-only  shared  object  and  writable  shared  object.  The  read-only 
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object  has  local  data  which  cannot  be  modified  by  other  objects.  The  writable  object 
has  local  data  which  can  be  modified  by  other  objects.  Read-only  objects  C2ui  be  freely 
duplicated  as  many  times  as  desired.  However,  writable  objects  cannot  be  duplicated 
easily  because  maintaining  the  consistency  of  the  data  will  then  become  an  overhead. 
All  the  access  to  the  data  in  the  writable  shared  objects  needs  to  be  synchronized 
to  maintain  the  consistent  status  of  the  data.  An  active  object  cannot  be  a  sh2ired 
object,  since  by  definition,  no  other  object  can  invoke  the  methods  in  an  active  object. 
Sheu’ed  writable  objects  could  become  bottleneck  objects  as  they  may  have  to  be 
executed  sequentially  to  maintain  the  consistency  of  the  data.  Such  bottleneck  objects 
are  often  shared  components  requiring  synchronization  among  objects  accessing  it 
concurrently.  Identifying  such  bottleneck  objects  from  the  decomposition  and  refining 
the  decomposition  to  reduce  or  remove  some  of  such  unnecessary  bottleneck  objects 
play  an  important  role  in  enhancing  the  parallelism. 


3.1.6  Specifying  the  Behavior  of  Objects 

In  this  step,  the  behavior  of  each  object  is  specified.  The  object  communication 
diagram  obtained  in  Step  3)  only  describes  the  static  structure  and  relationships  of 
the  objects  in  the  problem  domain.  It  does  not  provide  any  information  regarding  the 
behavior  of  the  software  system  to  be  developed.  That  is,  the  control  zispect  of  the 
software  system  is  not  specified  in  the  object  communication  diagram.  However,  to 
verify  and  analyze  the  decomposition,  we  need  to  define  the  behavior  of  each  object. 
For  this  purpose,  we  use  the  notations  similar  to  those  in  [23]: 

•  SEQuentiai  execution  of  methods:  When  the  methods  mi,m2, . . .  ,m„  are  exe¬ 
cuted  sequentially  in  the  order  mi, m2, . .  • ,  mn,  its  behavior  is  specified  as: 

SEQ(mi,m2,...,mn) 

•  CONcurrent  execution  of  methods:  When  the  methods  mi,  m2, . . . ,  mn  are  exe¬ 
cuted  concurrently,  its  behavior  is  specifed  as: 

C0N(mi,m2, . . .  ,m„) 

•  WAIT  for  method  invocation:  When  an  object  is  waiting  for  the  invocation  of 
its  method  m  by  another  object  O  to  proceed  with  its  execution,  its  behavior  is 
specified  as: 


WAIT(m,0) 

•  SELect  a  method  for  execution  based  on  a  condition:  SEL  construct  behaves  like 
the  CASE  statement  in  ordinary  programming  languages.  The  SEL  construct 
selects  one  of  the  methods  based  on  a  condition.  When  an  object  selects  one  of 
the  methods  for  execution  from  the  methods  mi,  m2, . . .  ,m„  based  on  a  condi¬ 
tion  C,  its  behavior  is  specified  21s: 

SEL(C;mi,m2,...,mn) 
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•  ONE-OF  the  methods  for  execution  from  a  group  of  possible  methods:  ONE-OF 
construct  is  used  in  cases  where  different  objects  could  try  to  invoke  the  meth¬ 
ods  defined  in  the  object  O  simultaneously,  but  the  O  permits  only  one  object 
to  invoke  its  method  at  a  time.  That  is,  this  construct  serializes  the  requests, 
and  it  is  typically  used  to  describe  the  behavior  of  shared  writable  objects.  Note 
the  difference  between  the  SEL  and  the  ONE-OF  construct.  Among  the  set  of 
methods  mi, ... , mn  defined  in  an  object  when  the  object  permits  only  one  of  its 
methods  to  be  invoked  by  other  objects,  the  behavior  of  the  object  is  specified  as: 


ONE  -  OFiWAIT{mi,  Oj), . . . ,  VFA/r(m„,  Ok)) 

ONEl-OF  will  always  be  associated  with  a  WAIT  construct  in  a  shared  writable 
object  because  the  object  0  will  have  to  wait  for  other  objects  to  invoke  its 
methods. 


3.1.7  Identifying  Bottleneck  Objects 

In  this  step,  the  bottleneck  object  which  may  unnecessarily  degrade  the  performance 
of  the  software  system  is  identified.  Usually,  a  bottleneck  object  will  be  a  shared 
writable  object.  One  can  identify  a  shared  writable  object  from  the  description  of  the 
object  behavior  in  the  above  step.  If  the  behavior  of  an  object  has  a  construct  of  the 
type  ONE  —  OF{W AIT{. . .),  WAJT{. . .) . . .),  then  this  object  is  also  a  bottleneck 
object.  Such  objects  limit  the  parallelism  in  the  software  system.  If  such  an  object 
is  found,  then  redo  or  refine  the  object-oriented  analysis  to  reduce  the  bottleneck  if 
possible.  This  step  may  increase  the  number  of  objects  in  the  software  system  Repeat 
Steps  2)  to  6)  until  the  object-oriented  analysis  is  found  satisfactory. 

3.1.8  Checking  For  Completeness  and  Consistency 

In  this  step,  the  result  of  the  object-oriented  analysis  is  verified  with  the  user  re¬ 
quirements.  From  the  given  user  requirements,  the  possible  threads  of  controls  are 
identified,  and  each  of  them  is  examined  using  the  behavior  of  the  objects  specified 
in  Step  5).  The  first  activity  in  any  control  thread  must  begin  in  one  of  the  active 
objects.  The  sequence  of  activities  in  each  control  thread  must  be  reachable  by  tracing 
the  behavior  of  the  objects.  If  there  is  any  control  thread  that  cannot  be  followed, 
the  decomposition  is  incorrect  and  the  decomposition  steps  need  to  be  reviewed.  The 
consistency  among  objects  is  verified  by  examining  whether  input  parameters  of  the 
methods  being  called  are  defined  as  local  variables  in  the  calling  object  and  the  out¬ 
put  parameters  of  the  methods  being  csJled  are  defined  as  local  variables  in  the  called 
object. 


14 


3.2  Object  Design 


In  our  approach,  the  object  design  is  specified  using  the  notations  defined  in  PROOF /L 
[l].  The  class  interface  definitions  and  information  about  the  object  behavior  are  used 
to  design  the  objects.  Our  approach  to  object  design  involves  three  steps: 

1)  Establish  the  class  hierarchy. 

2)  Design  the  class  composition  and  the  methods  in  each  object. 

3)  Design  the  bodies  of  the  active  and  pseudo-active  objects. 


3.2.1  Establishing  Class  Hierarchy 

Since  some  common  operations  2uid/or  attributes  between  the  objects  may  not  be 
apparent  in  the  analysis  phase,  different  objects  are  reexamined  to  identify  the  com¬ 
monality  among  the  classes  in  the  design  phase.  A  set  of  operations  and/or  attributes 
that  are  common  to  more  than  one  class  can  be  abstracted  and  implemented  in  a 
common  class  called  the  superclass.  The  subclasses  then  have  only  the  specialized 
features.  In  some  cases,  a  superclass  can  be  extracted  from  a  single  subclass  and  put 
in  the  class  library  if  needed.  Establishing  a  class  hierarchy  in  the  form  of  super¬ 
classes  and  subclasses  increases  the  inheritance  in  the  application.  Class  hierarchy 
also  enhances  the  modularity  and  the  extensibility  of  the  software  system  [24]. 


3.2.2  Designing  Class  Composition  and  Methods 

In  this  step,  the  composition  and  the  methods  for  each  object  class  are  designed.  The 
class  definition  consists  of  composition  and  methods.  The  composition  defines  the 
internal  data  structure  of  the  class.  Various  constructors,  such  as  list  and  Cartesian 
product,  are  provided.  A  typical  functional  style  is  adopted  in  the  method  definition. 
A  rich  set  of  functional  forms,  i.e.  high-order  functions,  as  well  as  primitive  func¬ 
tions  are  predefined.  In  the  method  design,  the  internal  state  of  the  object  to  which 
the  method  belongs  is  included  as  both  the  input  and  output  parameters  so  that 
side-effects  can  be  avoided.  A  method  of  an  object  consists  of  an  optional  guard  and 
an  expression.  The  guard  is  a  predicate  specifying  synchronization  constraints  amd 
the  expression  statement  specifies  the  behavior  of  the  method.  The  synchronization 
among  concurrent  objects  is  achieved  by  the  guards  attached  to  the  methods.  The 
guard  attached  to  a  method  is  defined  in  a  way  that  it  only  depends  on  the  status  of 
the  local  data,  but  does  not  depend  on  the  definition  of  other  methods.  Therefore,  the 
guards  are  directly  inheritable  with  the  methods.  The  expression  is  a  purely  applica¬ 
tive  function  and  is  specified  informally  in  a  natural  language.  Due  to  the  referential 
trmsparency  of  applicative  functions,  fine  grain  pzurallelism  can  be  exploited.  The 
method  execution  can  be  done  as  follows: 


IS 


1.  Evaluate  the  guard  associated  with  the  method.  If  the  method  does  not  have  a 
guard,  then  we  assume  that  we  have  a  guard  that  evalutes  to  true  always. 

2.  If  the  guard  is  true,  then  execute  the  expression;  otherwise,  go  back  to  (1). 

When  there  are  simultaneous  attempts  to  access  the  same  object  through  in¬ 
vocation  of  its  methods,  the  selection  of  one  method  for  execution  is  done  non- 
deterministically. 

It  is  desirable  to  refine  the  methods  that  access  the  shared  objects.  For  example, 
let  object  01  invoke  a  method  m  defined  in  the  shared  object  02.  Now  suppose  that 
the  method  m  requires  to  read  data,  perform  some  computation  based  on  the  data 
and  then  modify  the  local  data  of  02.  Then  the  guard  of  the  method  m  needs  to  be 
evaluated  before  the  execution  of  the  method  m  begins.  The  activities,  reading  and 
computing,  performed  on  02  can  be  executed  in  parallel  when  another  object  invokes 
this  method  because  those  operations  do  not  involve  any  shared  data.  However,  these 
activities  cannot  be  invoked  by  another  object  in  parallel  if  the  method  m  contains 
these  activities  as  part  of  its  code.  Thus,  the  method  should  be  refined  into  smaller 
methods  in  such  a  way  that  the  guard  can  affect  the  execution  of  a  short  segment 
of  code  only.  This  refinement  of  method  is  similar  to  the  refinement  of  the  object  to 
reduce  the  bottleneck  in  the  object-oriented  analysis  stage. 

Selection  of  algorithm  and  data  structure  is  an  important  part  of  the  method  de¬ 
sign.  The  selection  of  algorithms  to  accomplish  a  specific  task  should  be  based  on 
certain  criteria  which  satisfy  the  required  constraints  such  as  accuracy,  timing  re¬ 
quirements,  use  of  common  utilities  across  the  design,  reuse  of  previously  developed 
software,  computational  complexity,  flexibility,  ease  of  implementation,  understand- 
ability,  etc. 

The  underlying  architecture  of  the  machine  will  not  be  an  influencing  factor  if  the 
algorithm  is  to  be  executed  on  a  single  processor.  On  the  other  hand,  if  the  algo¬ 
rithm  is  to  be  executed  on  a  configuration  of  parzdlel  processors,  then  the  algorithm 
selection  decision  will  usually  be  influenced  by  the  underlying  architecture  on  which 
the  algorithm  is  to  be  executed.  In  the  case  of  reconfigurable  architectures,  the  algo¬ 
rithm  selection  can  be  based  on  the  performance  requirements  and  the  designer  will 
not  need  to  be  concerned  with  the  configuration  of  the  computer.  The  computer  can 
then  be  reconfigured  to  reflect  the  structure  of  the  algorithm.  The  designer  of  the 
algorithm  to  be  executed  on  more  than  one  processor  on  a  parallel  processing  system 
should  be  aware  of  the  configurations  that  are  available  on  the  system,  such  as  the 
maximum  number  of  immediate  neighbors  that  a  processor  could  have  in  the  system 
architecture,  etc.  For  example,  a  transputer  could  have  a  maximum  of  four  immediate 
neighbors;  and  a  hypercube  could  have  a  number  of  immediate  neighbors  depending 
on  the  dimension  of  the  hypercube.  While  designing  the  algorithms,  new  classes  of 
objects  may  be  defined  to  make  the  implementation  more  efficient.  These  are  low 
level  objects  and  are  not  usually  visible  extemedly.  New  classes  called  internal  classes 
may  also  be  defined  at  this  stage  for  the  purpose  of  implementation,  but  they  are  not 
reflected  in  the  user  requirement. 
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3.2.3  Designing  the  Bodies  for  Active  and  Pseudo-Active  Objects 

In  this  step,  a  body  is  associated  with  each  active  and  pseudo-active  object.  There  is 
no  body  associated  with  a  passive  object  as  it  does  not  invoke  any  methods.  The  role  of 
a  body  is  to  invoke  a  method  and  to  modify  the  state  of  the  objects  represented  by  their 
local  data.  The  body  in  each  object  is  expressed  in  the  form  Ci//e2/ f ...  f  ftk  where 
each  e{  is  an  expression  representing  method  invocations  and  expressions  separated 
by  //  are  evaluated  simultaneously.  //  is  a  parallel  construct  indicating  parallel 
execution,  fi  can  be  recursively  defined  and  can  be  diverse.  Thus,  the  evaluation 
process  may  be  infinite.  In  /,,  methods  of  objects  can  be  invoked,  and  the  states  of 
objects  may  be  modified.  The  modihcation  of  an  object  is  expressed  by  the  reception 
construct  which  has  the  form  A[|0|]e,  where  O  called  a  recipient  object  is  an  object 
name  and  e  is  an  expression  with  applications  of  purely  applicative  functions  only. 
The  reception  construct  can  occur  only  in  the  bodies  of  active  and  pseudo-active 
objects.  The  reception  construct  indicates  that  the  object  O  will  receive  the  value 
returned  as  a  result  of  evaluating  the  expression  e.  This  construct  modifies  the  states 
of  the  object.  It  differs  from  the  conventional  assignment  in  the  following  aspects: 

1.  The  expression  c  contains  only  purely  applicative  functions.  Thus,  the  evaluation 
of  the  expression  e  is  side-effect  free  and  can  be  parallelized. 

2.  The  expression  c  must  return  a  new  state  of  the  object  O.  O  receives  the  new 
state  as  a  whole  entity.  Therefore,  no  partial  modification  and  no  inconsistent 
state  of  the  object  is  possible. 

The  object  0  may  be  composed  of  other  objects.  However,  the  composing  objects 
cannot  appear  in  the  reception  construct  as  a  recipient.  The  overhead  involved  in 
complying  with  this  no  partial  modification  rule  can  be  minimized  by  optimization 
based  on  static  data  dependency  analysis. 

The  body  of  an  object  can  be  derived  using  the  class  interface  and  the  object  be¬ 
havior  obtained  from  the  analysis  stage.  We  also  need  to  introduce  the  modification 
operator  “R,  in  the  body  of  the  objects  that  are  modified.  The  objects  that  are  modified 
can  be  determined  from  the  method  definitions  given  in  the  class  interface.  Consider 
an  object  0i  defined  as  the  output  of  a  method  m.  Whenever  an  object  O2  invokes 
the  method  m  defined  in  Oi,  O2  will  be  modified.  Thus,  in  the  body  of  O2,  when  m  is 
invoked,  the  modification  operator  is  substituted  in  the  place  of  the  method 

invocation. 

The  object  behavior  is  specified  using  the  control  constructs,  SEQ,  CON,  SEL  and 
ONEl-OF,  and  statements  including  method  names  and  WAIT  clauses.  Transforming 
the  control  constructs  into  equivalent  body  is  straightforwsud.  The  SEQ,  CON  and 
SEL  constructs  are  transformed  into  sequential,  concurrent,  amd  if-then-else  types  of 
constructs  respectively  in  the  body.  These  constructs  appear  in  the  active,  passive 
and  pseudo-active  objects.  WAIT  cosntruct  appears  only  in  the  passive  and  pseudo¬ 
active  objects,  and  not  in  the  active  objects  as  the  active  objects  do  not  wait  for  other 
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objects  to  invoke  their  methods.  ONE-OF  construct  appears  only  in  the  passive  ob¬ 
jects  or  in  the  passive  part  of  the  pseudo-active  objects.  A  WAIT  construct  is  usually 
associated  with  the  ONE-OF  construct. 

When  a  WAIT  construct  is  encountered  in  the  object  behavior,  then  a  guard  has 
to  be  associated  with  the  method  which  is  WAITing  to  be  invoked  by  other  objects. 
The  guard  value  will  be  set/reset  by  the  interacting  objects.  Alternately,  a  guard  can 
also  be  introduced  into  the  body  of  the  object  which  is  WAITing.  However,  since 
the  computation  model  PROOF  supports  only  the  former,  we  will  use  only  the  for¬ 
mer  strategy.  The  latter  method  has  the  advantage  in  that  the  body  design  follows 
directly  from  the  object  behavior,  while  in  the  former,  the  methods  will  have  to  be 
designed  only  after  t2dcing  the  WAIT  constructs  into  consideration.  The  design  of 
the  method  will  then  become  dependent  on  the  object  behavior,  introducing  am  extra 
level  of  complexity  which  can  be  avoided  when  a  guard  is  introduced  at  the  body  level. 

The  behavior  of  a  purely  passive  object  begins  with  a  WAIT  clause  or  a  ONE-OF 
clause.  The  guard  mechanism  in  this  case  can  be  used  to  enforce  mutual  exclusion 
if  the  object  is  a  shared  writable  object,  A  pseudo-active  object  should  also  begin 
with  a  WAIT  clause.  The  method  WAITing  to  be  invoked  should  then  be  redesigned 
to  include  a  guard  which  will  be  set  only  by  the  invoking  object  and  will  be  reset 
by  the  called  object.  In  such  a  case,  the  invoking  object  will  then  clear  the  way  for 
the  pseudo-active  object  to  start  its  thread  of  operation.  The  guard  will  have  to  be 
checked  by  the  pseudo-active  object  and  continue  further  execution  only  when  the 
guard  evaluates  to  True.  For  this  purpose,  we  can  add  an  additional  method  in  the 
class  of  the  pseudo-active  object  which  will  only  evaluate  this  guard.  This  step  is 
further  illustrated  in  the  example,  discussed  later,  when  designing  the  body  of  the 
pseudo-active  object.  If  the  analysis  was  such  that  a  pseudo-active  object  starts  its 
thread  in  the  beginning  and  then  encounters  a  WAIT  state,  then  we  can  brezdc  up 
this  object  into  an  active  object  and  a  passive  object,  with  an  object  communication 
link  between  these  two  objects. 


3.3  Verification 


The  design  of  the  objects  done  in  the  previous  phase  has  to  be  verified  and  analyzed 
for  various  liveness  and  safeness  properties.  For  this  purpose,  we  trzmsform  our  design 
into  Petri  nets  [25].  Petri  nets  have  been  selected  in  our  approach  mainly  because  our 
design  can  be  easily  represented  in  the  Petri-net  model  and  because  many  techniques 
have  been  developed  to  analyze  Petri-net  models  for  various  liveness  and  safeness 
properties  [26]-[32]  An  extensive  survey  of  Petri  nets  and  their  applications  is  given 
in  [33],  and  an  overview  of  existing  tools  is  given  in  [34]. 

Petri  nets  can  be  used  to  model  both  the  static  and  dynaimic  properties  of  the 
systems.  Static  properties  of  the  systems  are  represented  by  the  graph  part  of  a  Petri 
net,  while  the  dynamic  properties  of  the  system  can  be  determined  from  the  Petri- 
net  graph,  the  initial  marking  and  the  firing  rules.  The  advantages  of  modeling  a 
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dynamic  system  with  Petri  nets  are:  their  ability  for  graphical  and  precise  nature  of 
representation;  analysis  tools  to  determine  and  verify  the  dynamic  behavior  of  the 
system  from  its  structure;  and  the  capability  to  design  the  system  using  top-down 
and  bottom-up  approaches. 

The  transformation  of  our  design  to  Petri  nets  consists  of  the  following  three  steps: 

1)  Transformation  of  bodies  to  Petri  nets. 

2)  Composition  of  the  Petri  nets. 

3)  Refinement  of  the  Petri  nets. 


3.3.1  Transformation  of  bodies  to  Petri  nets 

To  transform  the  bodies  designed  into  Petri  nets,  we  use  places  as  the  token  holder 
for  the  control  flow,  transitions  as  the  methods,  and  the  arcs  between  places  and 
transitions  as  the  control  flows.  Since  a  body  is  represented  as  a  statement  consisting 
of  control  constructs  and  method  names,  we  show  the  transformation  for  each  of 
control  constructs,  viz.,  CON,  SEQ,  SEL  and  ONE-OF  in  Figure  3.3.  Expressions  in 
the  body  of  an  active  object  could  have  methods  that  do  not  require  modifications 
and/or  methods  that  require  modification  by  using  the  construct  72..  The  method 
requiring  modification  needs  to  be  executed  in  serial  to  maintain  the  consistency  of 
the  object  state.  These  two  kinds  of  methods  are  represented  differently  in  the  Petri- 
net  representation  after  the  transformation.  For  this  purpose,  an  additional  place 
called  the  bottleneck  place  is  associated  with  such  a  method.  The  bottleneck  place 
seriadizes  the  execution  of  the  methods.  Figure  3.4.  shows  the  transformation  of  such 
a  method.  The  bottleneck  place  will  also  be  used  to  compose  the  Petri  nets  in  the 
next  step. 


3.3.2  Composition  of  the  Petri-nets 

The  Petri-net  representations  of  the  bodies  of  the  active  or  pseudo-active  objects  that 
interact  with  each  other  should  be  composed  together  so  that  they  can  be  analyzed 
together.  To  compose  the  nets,  we  identify  the  transitions  or  the  places  that  serve 
as  interaction  points.  The  interaction  among  objects  occurs  only  when  there  is  an 
object  modified  by  other  objects.  When  am  object  interamts  with  another  object  for 
accessing  the  shamed  writable  object,  the  bottleneck  plaice  will  be  common  to  both 
the  objects.  Since  the  bottleneck  place  is  used  to  serialize  the  interaction  among  the 
methods  requiring  modification,  they  cam  be  used  ais  the  fusion  point.  When  the  nets 
are  to  be  composed,  the  body  of  the  amtive  object  is  searched  for  the  methods  that 
require  modification.  When  such  methods  ame  found,  two  caises  arise.  For  exaunple, 
consider  two  amtive  objects  Oa  and  Ob  having  the  following  bodies: 

Oa-  SEQ(mi,...,72[|  O.  |]  m„) 

Ob-  SEQ(mi»,...,72(|  Oj  |]  m„«) 
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(d) 

Figure  3.3:  Transformation  of  the  control  constructs:  (a)  SEQ(mi,m2, . . .),  (b)  CON(SEQ(mi, . . .), 
SEQ(mj, ...)),  (c)  SEL(C;SEQ(mi,m,,...),  SEQCIi,/,,...)),  (d)  ONE-OF(mx,m2,m3). 


Figure  3.4:  TVansformation  of  a  method  requiring  modification. 

When  the  methods  m„  and  m„'  are  defined  in  different  classes  (G,  ^  Oj)  there  is  no 
common  bottleneck  place  between  Oa  and  0^.  Hence,  no  composition  of  the  nets  is 
necessary.  When  the  methods  nin  and  m„«  are  defined  in  the  same  object  (O,-  =  Oj), 
the  two  bottleneck  places  associated  with  the  two  methods  are  combined  to  one.  This 
process  is  called  fusion  of  places  and  is  illustrated  in  Figure  3.5. 

3.3.3  Refinement  of  Petri  nets 

The  purpose  of  the  refinement  is  to  replace  a  transition  or  place  by  a  more  complex 
Petri  net  in  order  to  give  a  more  detailed  description  of  the  activity  involved  in  the 
transition  or  place  respectively.  It  is  analogous  to  the  module  concepts  found  in  many 
programming  languages.  At  one  level,  a  simple  abstract  description  of  the  activity 
is  given  without  considering  the  detailed  behavior.  At  another  level,  by  refining  the 
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Figure  3.5:  CJombining  two  objects  with  a  common  bottleneck  place. 


nets,  a  more  detailed  description  of  the  activities  taking  place  at  the  transition  or 
place  is  specified. 

The  tr2msition/place  can  be  refined  according  to  the  following  rules.  Suppose  that 
a  transition/place  U  is  replaced  with  a  subnet  S. 

•  The  subnet  5  consists  of  three  parts:  input  transition/place,  a  refinement  of  net 
called  block,  and  output  transition/place. 

•  The  incoming  arcs  of  U  serve  as  the  incoming  arcs  to  the  input  transition/place. 

•  The  outgoing  arcs  from  t,  serve  as  the  outgoing  arcs  from  the  output  transi¬ 
tion/place. 

•  Only  one  transition  receives  all  the  input  parameters. 

•  Only  one  transition  produces  all  the  output  parameters. 

•  All  the  transitions  except  these  two  input  and  output  transitions  can  only  interact 
with  the  places  and  transitions  defined  within  the  subnet  S. 

•  All  the  places  can  only  interact  with  transitions  defined  within  the  subnet  S. 

For  example,  since  a  method  consists  of  an  expression  with  an  optional  guard,  the 
transitions  may  have  to  be  refined  to  specify  the  guard  and  the  expression.  This  is 
done  as  follows:  Let  a  method  m,  consist  of  a  guard  gi  and  an  expression  e,-.  Then 
the  transition  for  mi  can  be  refined  as  follows:  Guard  evaluation  is  specified  as  a 
place  and  there  is  a  transition  associated  with  each  result  •  true  or  false  -  of  the 
guard  evaluation.  In  case  of  True  transition,  the  expressions  are  executed.  In  case  of 
False  transition,  go  back  to  the  guard  place  to  evaluate  the  guard  again.  The  use  of 
True  and  False  transitions  is  analogous  to  the  method  specified  in  [25]  to  represent 
condition  statement.  This  refinement  process  is  shown  in  Figure  3.6. 

After  the  coding,  when  the  complete  definition  of  the  expressions  for  all  methods 
is  given,  the  refinement  of  the  transitions  representing  all  the  expressions  can  be 
done.  Suppose  we  have  an  expression  c,:  f(g(a,b),h(c,d)).  Then,  the  transition  for  the 
expression  e,  may  be  refined  as  shown  in  Figure  3.7. 


3.4  Coding  in  PROOF/L 

The  design  of  the  software  system  will  be  implemented  by  writing  the  program  in 
PROOF/L.  The  coding  in  PROOF/L  is  rather  straightforward  and  will  not  be  dis¬ 
cussed.  The  concrete  syntax  of  the  PROOF/L  language  is  given  in  Section  5.1.  The 
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PROOF/L  code  will  have  to  be  translated  to  the  code  in  a  selected  target  l^Lnguage 
that  can  be  executed  on  the  parallel  processing  system.  In  our  case,  we  use  a  Parsytech 
board  consisting  of  a  network  of  sixteen  T800  transputers,  and  have  developed  the 
tr2Lnsformation  from  PROOF/L  code  to  INMOS  C  code. 


3.5  Transformation  from  PROOF/L  to  any  Target  Language 

The  transformation  of  the  program  written  in  PROOF/L  into  a  target  language  in¬ 
volves  the  five  major  steps  as  shown  in  Figure  3.1:  partitioning,  firont-end  translation, 
grain  size  analysis,  back-end  translation  and  allocation.  In  the  partitioning  step,  the 
objects  in  the  software  system  are  partitioned  into  a  set  of  clusters.  The  objective  of 
our  partitioning  approach  is  to  improve  the  performance  of  the  software  by  reducing 
communication  cost  among  processors  while  maintaining  the  parallelization  among 
objects.  Dtiring  this  phase,  we  are  only  concerned  with  the  coarse  grain  parallelism 
among  objects.  In  the  front-end  translation,  PROOF/L  code  is  translated  into  an 
Intermediate  Program  Representation  (IPR).  The  purpose  of  this  phase  is  to  express 
all  the  parallelism  in  the  PROOF/L  code  explicitly.  As  discussed  in  Section  2.2, 
functional  paradigm  in  PROOF  makes  it  easy  to  detect  all  the  parallelism.  Once  the 
parallelism  is  expressed  explicitly,  grain  size  analysis  is  performed  to  determine  the 
proper  size  of  tasks  to  be  executed  in  different  processors.  For  this  purpose,  IPR  is 
served  as  a  task  precedence  graph  and  a  modified  intermediate  form  is  generated. 
In  the  back-end  translation,  the  modified  intermediate  form  is  translated  into  corre¬ 
sponding  equivalent  Inmos  C  code.  Then  the  Inmos  C  code  is  allocated  to  physic2d 
processors.  In  Chapters  4  through  7,  we  will  present  our  techniques  for  partitioning, 
front-end  translation,  grain  size  determination  and  back-end  translation  in  detail. 
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Chapter  4 


Partitioning 


In  order  to  distribute  the  modules  to  processors  of  a  parallel  processing  system  so 
that  the  execution  time  of  the  software  can  be  minimized,  we  need  to  partition  the 
software  system  into  a  set  of  modules  and  then  assign  them  to  the  processors.  We  call 
the  first  stage  partitioning  and  the  second  allocation.  Intuitively,  to  exploit  parallel 
processing  power,  the  modules  should  be  executed  in  parallel  as  much  as  possible. 
On  the  other  hand,  to  reduce  the  negative  effect  of  high  communication  overhead  on 
the  system  performance,  the  modules  should  be  distributed  over  as  few  processors  as 
possible.  The  treide-off  between  these  two  conflicting  criteria  has  been  well  known 
in  the  study  of  parallel  processing  systems  as  well  as  distributed  computing  systems. 
In  parallel  processing  systems,  it  is  very  difficult  to  achieve  linear  speedup  due  to 
communication  costs  zunong  processors,  contention  of  shared  resources  and  inability 
to  keep  all  the  processors  busy  as  much  as  possible  [36].  That  is  one  of  the  reasons 
that  there  is  a  large  gap  between  the  ideal  peak  performance  and  the  real  performance 
in  most  parallel  processing  systems. 

The  problem  of  partitioning  for  parallel  and  distributed  computing  systems  has 
been  studied  extensively.  The  approaches  are  divided  into  three  categories:  graph- 
theoretic  [37,  38],  integer  programming  [39,  40]  and  heuristics  [41]-[45].  The  graph 
theoretical*  approach  [37]  uses  2m  undirected  graph  to  model  a  software  system,  and 
then  minimizes  the  total  interprocess  communication  cost  by  performing  a  min-cut 
algorithm  on  the  graph.  Its  application  is  limited  due  to  its  exponential  time  complex¬ 
ity  as  more  than  two  processors  are  used.  Furthermore,  the  incorporation  of  various 
constraints  into  such  a  model  is  difficult.  The  integer  progriunming  approach  [39,  40] 
is  based  on  the  implicit  enumeration  algorithm.  It  is  easy  to  incorporate  additional 
constraints  into  their  models  in  order  to  satisfy  various  application  requirements.  Put, 
the  amount  of  time  and  memory  space  required  to  obtain  an  optimal  solution  grows 
exponentially  with  the  number  of  modules  of  the  software  system.  The  above  men¬ 
tioned  approaches  attempt  to  And  partitions  with  the  objective  of  minimizing  the 
sum  of  the  communication  time  and  the  processing  time  of  the  software.  In  general, 
this  problem  is  NP-hard.  Thus,  heuristic  approaches  are  applied  to  provide  fast  and 
effective  algorithms  for  a  suboptimum  solution.  Comparing  with  optim2d  solution 
methods,  the  heuristic  methods  are  faster,  more  extensible,  and  simpler.  They  are 
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also  applicable  to  large  dimensional  problems  and  the  problems  for  which  optimal 
solutions  cannot  be  obtained  in  real  time. 

One  of  the  common  assumptions  in  these  approaches  is  that  the  execution  time 
for  each  module  and  the  communication  time  among  modules  are  given  as  input. 
In  other  partitioning  approaches  [46]- [48],  the  software  is  represented  as  a  directed 
graph  in  which  each  node  represents  a  computation  task  and  each  arc  between  two 
nodes  represents  precedence  relation  in  terms  of  data  flow  or  control  flow.  However, 
these  approaches  cannot  be  directly  applied  to  the  partitioning  stage  of  the  object- 
oriented  software  development  for  parallel  processing  systems  because  they  ignore  the 
existence  of  the  shared  data.  For  instance,  in  our  approach,  the  software  system  is 
considered  consisting  of  a  set  of  objects  where  every  object  can  contain  shared  data 
that  may  be  accessed  by  a  number  of  objects.  When  the  access  to  the  shared  data 
requires  modiflcation  of  the  data,  the  access  must  be  serialized  in  order  to  maintain  the 
consistency  of  data  state.  When  an  object  containing  shared  data  is  simultaneously 
accessed  by  a  number  of  objects  and  the  accesses  do  not  require  modifying  the  shared 
data,  the  parallel  invocation  of  methods  in  the  object  should  be  allowed.  Most  of 
the  existing  partitioning  approaches  cannot  be  used  when  the  software  is  decomposed 
as  a  set  of  such  objects.  In  this  section,  we  will  present  a  partitioning  approach  to 
overcome  these  difflculties. 


4.1  Our  Partitioning  Approach 


The  objective  of  our  partitioning  approach  is  to  improve  the  overall  performance  of 
the  software  by  reducing  communication  cost  among  processors  while  maint2uning  the 
potential  parallelism  among  objects.  The  details  of  our  partitioning  approach  with 
illustrative  examples  has  been  presented  in  [49]. 

The  input  for  our  algorithm  includes  the  behavior  of  the  objects  in  the  software 
system  that  is  expressed  using  the  constructs  discussed  in  Section  3.1.6,  communica¬ 
tion  intensity  information  extracted  from  the  requirement  an2dysis,  and  the  number 
of  replications  for  each  object  as  required  for  such  purposes  as  fault  tolerance.  The 
output  of  our  algorithm  will  be  an  undirected  weighted  graph  in  which  every  node  rep¬ 
resents  a  cluster  of  objects  and  every  edge  between  two  nodes  has  a  positive  weight 
which  represents  the  degree  of  contribution  that  can  be  made  to  the  enhancement 
of  the  overall  performance  by  parallel  execution  of  the  two  clusters  represented  by 
the  two  nodes.  Our  partitioning  approach  consists  of  the  three  parts:  initialization, 
normalization  and  clustering. 


4.1.1  Initialization 

We  begin  with  an  undirected  weighted  graph,  in  which  each  node  represents  an  object, 
and  there  is  an  edge  between  the  nodes  of  two  objects  if  and  only  if  either  one  of  the 
two  objects  invokes  the  other  or  both  objects  are  invoked  concurrently  by  another 
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object.  Every  node  has  a  non-negative  weight  that  is  equal  to  the  number  of  the 
replications  of  the  object  represented  by  that  node.  Every  edge  in  the  graph  has  two 
kinds  of  weights:  one  for  communication  and  another  for  concurrency. 

Communication  weight  associated  with  an  edge  represents  the  communication  cost 
incurred  if  the  objects  of  the  two  nodes  incident  to  the  edge  communicate  with  one 
another,  and  they  are  not  allocated  on  the  same  processor.  If  object  O,-  invokes  another 
object  Oj  (i.e.  there  is  a  communication  overhead  between  the  two  objects),  then  a 
non-zero  conununication  weight  Uij  is  assigned  to  the  edge  connecting  the  nodes  of 
those  objects.  The  value  of  Uij  is  equal  to  the  product  of  object  invocation  frequency, 
/,-j,  and  the  number  of  data  units  transferred  between  the  two  objects  every  time 
one  of  them  invokes  the  other  (i.e.  the  two  objects  communicate).  We  assume  that 
the  information  needed  to  compute  the  communication  weight  can  be  obtained  from 
the  analysis  of  requirement  specification.  A  negative  sign  is  given  to  conununication 
weights  to  imply  cost.  As  a  result,  the  smaller  communication  weight  implies  the 
more  communication  overhead. 

Concurrency  weight  associated  with  an  edge  corresponds  to  the  gain  in  improving 
the  overall  performance  of  the  system  that  can  be  obtained  by  parallel  execution  of 
the  two  involving  objects.  If  two  objects  are  invoked  concurrently  by  another  object, 
there  is  a  potential  parallelism  between  the  two  objects;  and  a  non-zero  concurrency 
weight  equal  to  the  frequency  by  which  the  two  objects  zu^e  invoked  is  assigned  to  the 
edge  connecting  their  nodes.  A  positive  sign  to  concurrency  weights  is  given  to  imply 
the  gain  achieved  as  a  result  of  potential  parallel  execution  of  the  involving  objects. 

The  imdirected  weighted  graph  of  the  software  system  G  =  {V,E)  has  a  set  of 
nodes  V  and  a  set  of  edges  E  such  that: 

•  Object  Oi  is  represented  by  node  O,  in  V. 

•  Edge  (Oi,  Oj)  is  in  E  if  and  only  if  Oi  zmd  Oj  can  communicate  with  one  another 
or  both  objects  can  be  invoked  concurrently  by  another  object. 

•  The  node  of  object,  say  a,  has  a  non-negative  weight,  denoted  by  To,  which  is 
equal  to  the  number  of  replications  of  the  object  modeled  by  that  node. 

•  An  ordered  set  of  weights  (ttij,Vi,)  is  associated  to  edge  {Oi,Oj)  where  Uij  and 
Vij  are  communication  and  concurrency  weights,  respectively. 

Let  fij  be  the  frequency  that  Oj  invokes  Oj  and  dij  be  the  number  of  data  units 
transferred  between  Oi  and  Oj  every  time  Oi  invokes  object  Oj.  Communication  and 
concurrency  weights  are  assigned  according  to  the  five  following  rules:  Rules  1-4  are 
applied  to  the  cases  where  objects  are  related  by  only  one  construct. 

Rule  1.  0\  :  C0N(02, 03, . . . ,  On)  describes  a  case  where  objects  O2, 03, . . . ,  On_i, 
and  On  are  executed  concurrently  after  being  invoked  by  object  0\.  It  corresponds 
to  a  subgraph  G  =  {V^E)  where  V  =  {0i,02,...,0n},  and  E  =  {(0,,0j),  1  <  t  < 
j  <  n}.  Communication  and  concurrency  weights  are  assigned  to  the  edges  in  E  as 
follows; 
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1)  For  2  <  i  <  j  <  n,  there  are  two  possibilities: 

a)  If  (0,-,  Oj)  is  new,  then  =  0  and  u,,  =  fu  =  /ij- 

b)  If  {Oi,  Oj)  is  old,  then  u,j  remains  unchanged,  and  =  Vij  +  fu. 

2)  For  i  =  1  and  2<j<n,  there  are  two  possibilities: 

a)  If  {Oi,Oj)  is  new,  then  u,,  =  —{fij  x  dij)  and  Vij  =  0. 

b)  If  (O,-,  Oj)  is  old,  then  Uy  =  Uij  —  (fij  x  dij)  and  Vij  remains  unchanged. 

Rule  2.  Oi  :  SEQ(02, 03, . . . ,  On)  describes  a  case  where  object  Oi  invokes  objects 
O^jOzy  •  •  ,On-i,  and  On  in  a  sequential  order.  It  corresponds  to  a  subgraph  G  = 
(V,E),  where  V  =  {0i,02,...,0n},  and  E  =  {(Oi,Oj),  2  <  j  <  n}.  In  assigning 
communication  and  concurrency  weights  to  the  edges  in  E,  tWe  are  two  possible 
cases  for  i  =  1  and  2  <j  <n: 

1)  If  (Oi,Oj)  is  new,  then  Uij  —  —{fij  x  dij)  and  Vij  =  0. 

2)  If  d(Oi^Oj)  is  old,  then  =  u,,  —  (fij  x  dij)  and  Vij  remains  unchanged. 

Rule  3.  Oi  :  0NE-0F(02,03, ...,On)  and  Oi  :  SEL(02,03, ...,On)  each  de¬ 
scribes  a  case  where  object  Oi  invokes  only  one  of  the  objects  0/s  for  2  <  j  <  n. 
The  corresponding  subgraph  is  O  =  (V,E),  where  V  =  {0i,02, . . . ,0„},  and  E  = 
{{Oi,Oj),2  <  j  <  n}.  In  assigning  communication  and  concurrency  weights  to  the 
edges  in  E,  there  are  two  possible  cases  for  i  =  1  and  2  <  j  <n: 

1)  If  {Oi,  Oj)  is  new,  then  u,,  =  -(/ij  x  di,)/(n  -  1)  and  v,,  =  0. 

2J  If  (Oi,  Oj)  is  old,  then  u,,  =  —  (fij  x  di,)/(n  —  1)  and  v,j  remzuns  unchanged. 

As  mentioned  Section  3.1.6,  this  construct  is  used  to  represent  the  synchronized  access 
of  the  shared  data. 

Rule  4.  Oi  :  WAIT  (Oj)  describes  a  case  where  object  O,-  waits  to  be  invoked 
by  object  Oj.  It  corresponds  to  a  subgraph  G  =  (V,E),  where  V  —  {Oi,Oj},  and 
E  =  {(Oi,Oj)}.  There  are  two  possibilities: 

1)  If  (Oi,Oj)  is  new,  then  u,j  =  =  0 

2)  If  (Oi,Oj)  is  old,  then  both  u,j  and  Vij  remain  unchanged. 

Rule  5  is  applied  to  nested  clauses.  Before  presenting  Rule  5,  we  define  the  preser¬ 
vation  of  the  edge  relationship,  denoted  by  E)-R,  between  two  subgraphs.  Let  Ga  aJid 
Gb  be  two  subgraphs  defined  as  =  (Va,Ea)  and  Gb  =  (Vb,Eb),  where  Va  = 
{xi, . . . , X,}  and  Vb  =  {yi,  •  •  • , yr }  for  some  y  >  1  and  r  >  1.  For  every  x  in  Va  and 
every  y  in  Vjg,  one  of  the  following  relations  hold: 

‘a  nonxero  communicmtion  weight  will  be  aaaigned  to  thU  edge  when  object  0>  ia  procened. 
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E-R(x,  y)  5=  True  if  there  is  an  edge  incident  to  x  and  y. 

El-R(x,  y)  =  False  otherwise. 

Then  the  preservation  of  the  edge  relationship  E-R  between  the  two  subgraphs  is 
defined  as  follows: 

E-R(Gyi,  Gb)  =  E-R(xi,j/i) 

=  E-R(x2,j/i) 

=  E-R(x„yi) 

=  E-R(xi,y2) 

=  E-R(x2,y2) 

=  E-R(x„yr). 

Rule  5.  It  is  applied  when  nested  clauses  are  used  to  specify  the  object  behavior. 
The  steps  are: 

1)  Modify  the  object  behavior  by  substituting  all  the  nested  clauses  with  dummy 
objects. 

2)  Select  and  apply  a  rule  based  on  the  construct  used.  For  every  dummy  object 
introduced  in  Step  1),  do  the  following: 

2.1)  Apply  an  appropriate  rule  and  preserve  the  edge  relationships  with  other 
objects. 

2.2)  Assign  communication  and  concurrency  weights  using  Rules  1-4. 

4.1.2  Normalization 

As  stated  eaulier,  the  goal  of  our  partitioning  approach  is  to  reduce  the  communication 
cost  between  the  processors  and  to  exploit  potential  parallelism  among  the  objects. 
Since  these  two  subgoals  are  conflicting,  it  is  desirable  to  find  an  optimal  point  at  which 
communication  costs  are  reasonably  reduced  while  the  parallel  execution  of  objects 
are  well  achieved.  However,  when  no  precise  information  on  the  execution  time  and 
communication  time  is  available,  an  optimal  solution  cannot  be  found.  Even  if  such 
information  were  available,  the  problem  of  clustering  that  to  be  discussed  in  the  next 
section  would  remain  NP-hard  [50]. 

In  order  to  accommodate  the  two  conflicting  partitioning  subgoals,  we  present 
a  normalization  method  so  that  the  communication  and  concurrency  weights  asso¬ 
ciated  with  every  edge  can  be  combined  to  obtain  a  common  metric  for  the  two 
kinds  of  weights.  Let  and  Vmax  be  the  minimum  communication  weight  and 
the  maximum  concurrency  weight,  respectively.  First,  we  replace  every  Umin  with 
(l/wm«n)  X  u  and  every  Vmax  with  (l/vma«)xt;.  This  brings  the  two  types  of  weights 
to  the  saune  scale.  Then,  we  define  for  every  edge  a  new  weight  W,  called  gain,  to 
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be  -  a  X  u  +  (1  —  a)  X  V,  where  the  coefficient  o  lies  in  the  range  of  (0,  1),  and  is 
determined  by  the  dharacteristics  of  the  underlying  parallel  processing  systems.  To 
obtain  an  optimal  a,  the  configuration  of  the  parallel  processing  system,  CPU  speed 
and  the  communication  unit  capability  are  needed  which  may  not  be  available  at  the 
partitioning  stage.  Therefore,  by  allowing  modification  of  a,  the  performance  of  the 
software  system  can  be  adjusted. 

We  now  have  a  graph  in  which  every  node  represents  an  object  (the  same  as  in  the 
initial  graph)  and  every  edge  has  only  one  weight,  denoted  by  W,  that  represents  the 
degree  of  contribution  to  the  overall  performance  of  the  system  that  can  be  made  by 
the  parallel  execution  of  the  objects  of  the  nodes  incident  to  that  edge.  Next,  we  will 
define  the  function  gain  for  the  graph  to  be  the  sum  of  the  weights  of  all  edges  in  the 
graph  excluding  any  weight  with  +oo  value. 


4.1.3  Clustering 

In  this  section,  we  will  show  how  to  cluster  the  objects  represented  by  the  nodes  in 
the  graph  in  order  to  increase  the  total  gain  for  the  graph  by  taking  a  bottom-up 
approach. 

Note  that  an  edge  with  a  positive  weight  suggests  that  parallel  execution  of  the  two 
involving  objects  will  reduce  the  completion  time  of  the  software  system.  Hence,  these 
two  objects  should  not  be  in  one  cluster.  On  the  other  hand,  an  edge  with  a  negative 
weight  implies  that  the  two  involving  objects  should  be  placed  in  one  cluster  because 
parallel  execution  of  those  objects  does  not  reduce  the  completion  time  of  the  system 
due  to  the  communication  overhead  occurring  if  they  are  not  executed  on  the  same 
processor.  If  the  weight  is  equal  to  zero,  we  choose  not  to  cluster  the  involving  objects 
for  the  following  reason:  Clustering  of  such  objects  cannot  contribute  in  increasing 
the  gain  for  the  graph.  Comparing  a  partition  consisting  of  many  small  processes 
with  one  consisting  of  a  few  large  processes,  the  partition  with  many  small  processes 
will  provide  the  allocation  phase  with  more  flexibility  for  the  purpose  of  load  balance 
or  growth  potential  [42]. 

The  input  is  an  undirected  weighted  graph  O'  =  (V',E'),  where  V*  =  {Oi,  Oj, . . . ,  Op} 
find  Oi  is  an  object  for  1  <  *  <  p.  Every  node  has  a  non-negative  weight  which  is 
equal  to  the  number  of  replications  of  the  object  represented  by  that  node.  Every 
edge  (Oi,Oj)  €  E"  has  a  weight  Wij.  We  define  function  SIZE  to  map  every  node  in 
the  graph  to  a  positive  integer  that  is  equal  to  the  number  of  objects  in  the  cluster 
represented  by  that  node.  The  value  of  fimction  SIZE  at  any  node  in  the  initial  graph 
is  defined  as  one  because  every  node  in  the  initial  graph  represents  only  one  object. 

The  steps  we  take  in  clustering  the  nodes  of  the  graph  are  as  follows: 

Step  1.  for  every  node  c  do  set  SIZE(c)  =  1. 
while  there  is  an  edge  with  a  negative  weight  and 
there  is  more  than  one  node  in  the  graph  do 
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begin 

Sttp  2.  Find  the  edge  in  the  graph  with  minimum  weight. 

Let  it  be  edge  (a,  b)  with  a  weight  1^06- 
Step  S.  Group  a  and  b  into  a  new  cluster  q. 

Set  SIZE(9)  =  SIZE(a)  +  SIZE(5). 

Step  4-  for  every  node  c  such  that 

El-R(c,  a)  =  true  or  E-R(c,  b)  =  true, 
there  are  four  possible  cases: 

Case  1.  if  E-R(c,  a)  =  true  and  E-R(c,  6)  =  true,  then 
set  E-R(c,  q)  =  true  . 

assign  to  edge  (c,  q)  a  weight  equal  to  (W^ea  +  Wdb) 
Case  2.  if  E-R(c,  a)  =  true  and  E-R(c,  b)  =  fidse,  then 
set  Ei-R(c,  q)  =  true. 

assign  to  edge  (c,  q)  a  weight  equal  to  Wea- 
Case  5.  if  E-R(c,  a)  =  false  and  E-R(c,  6)  =  true,  then 
set  E-R(c,  q)  —  true. 

assign  to  edge  (c,  q)  a  weight  equal  to  Wd,. 

Case  4-  if  E-R(c,  a)  =  false  and  E-R(c,  6)  =  false,  then 
set  E-R(c,  q)  =  false. 

Step  5.  if  (  SIZE(a)  =  1  and  r,,  >  1  )  then 
begin 

Step  5.1  Add  edge  (o,  q)  to  the  graph. 

Step  5.2  Assign  +oo  to  edge  (o,  q). 

Step  5.3  Set  r«  =  r«  -  1. 
end 

else 

begin 

Step  5.4  Delete  node  a. 

Step  5.5  for  every  node  c  such  that  E-R(c,  a)  =  true,  do 
assign  E-R(c,  a)  =  false. 

end 

Step  6.  Repeat  Step  6  for  node  b. 
end 

Step  7.  for  every  node  c  such  that  (  SIZE(c)  =  1  and  Tq  >  1  )  do 
begin 

Step  7. 1  Let  Tc  =  k.  Add  k  new  nodes,  called  c,  to  the  graph  and 
for  any  one  of  these  nodes  duplicate  the  edges  incident  to 
the  node  c  that  has  initially  been  in  the  graph. 

Step  7.2  Assign  a  +00  weight  to  the  edge  connecting  to  any  two  of 
these  new  nodes  to  one  another  or  connecting  any  one  of 
them  to  the  node  c  that  has  initially  been  in  the  graph. 

end 


In  Step  1,  the  value  of  function  SlZb  at  every  node  in  the  graph  is  set  to  one.  Steps 
2*6  are  executed  until  there  is  no  ec  «  ;  .vith  a  negative  weight  or  there  is  only  one  node 
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in  the  graph.  In  Step  2,  the  edge  with  a  minimum  weight  is  chosen  as  a  candidate 
for  reducing  commimication  cost.  In  Step  3,  the  two  nodes  of  the  edge  chc^en  in  Step 
2  are  clustered  into  a  new  node  and  the  size  of  the  new  node  is  computed.  In  Step 
4,  the  weights  affected  by  the  addition  of  the  new  node  are  modified.  In  Steps  5  and 
6,  the  clustered  nodes  are  deleted.  When  the  node  is  a  cluster  or  is  not  replicated, 
the  node  along  with  all  the  edges  associated  with  it  are  deleted.  When  the  node 
has  replicates,  we  replace  the  node  with  its  replicate  and  assign  the  largest  possible 
weight,  +00,  to  the  edge  connecting  the  new  cluster  and  the  node  representing  its 
replicate  so  that  no  object  can  be  grouped  with  its  replicate  to  the  same  cluster.  In 
Step  7,  the  nodes  that  each  represents  only  one  object  where  the  object  has  some 
replications  to  be  considered  are  identified.  If  it  turns  out  that  k  replications  of  the 
object  are  to  be  considered,  then  k  new  nodes  of  the  object  are  created  and  the  edges 
incident  to  any  one  of  these  new  nodes  has  the  szime  weight  as  that  of  the  identicetl 
edge  incident  to  the  old  node  of  that  object.  The  edges  that  connect  either  any  two 
of  these  new  nodes  to  one  another  or  any  one  of  these  new  nodes  to  the  old  node  of 
the  same  object,  should  be  assigned  a  weight  of  +oo.  This  measure  is  taken  to  insure 
that  no  object  will  be  placed  with  its  replication  in  one  cluster. 

The  output  is  an  undirected  weighted  graph  in  which  every  node  models  a  cluster 
of  object(s)  and  every  edge  has  a  positive  weight  which  represents  the  degree  of 
contribution  to  the  overall  performance  of  the  system  that  can  be  made  by  the  parallel 
execution  of  the  two  clusters  represented  by  the  two  involving  nodes.  Note  that  a 
larger  weight  implies  more  gain  can  be  obtained  as  a  result  of  allocating  the  involving 
clusters  on  two  different  processors. 


4.2  Time  Complexity 


The  time  complexity  of  the  clustering  algorithm  described  above  is  a  function  of  m, 
and  e'  where  m  is  the  niunber  of  objects  in  the  graph  including  replicated  objects,  and 
e'  is  the  total  number  of  edges  if  replicated  objects  were  also  included  in  the  graph. 
Step  1  takes  0(n)  time.  Step  2  runs  in  O(e').  Step  3  has  a  constant  running  time.  Each 
pass  of  Step  4  can  run  in  0(n)  time.  Because  there  are  at  most  n  objects  involved, 
the  time  complexity  of  Step  5  is  0(n  x  log{e')).  Step  6  is  simply  the  repetition  of  Step 
5.  The  while  loop  will  be  executed  at  most  c'  times.  This  makes  the  time  complexity 
of  the  loop  0(  min  (c',m))  x  max(  e',n))  which  is  equivalent  to  0(e'  x  m).  Step  7 
can  also  run  in  0(c'  x  m)  in  the  worst  case.  Therefore,  the  time  complexity  of  the 
clustering  algorithm  is  0(e'  x  m)  in  the  worst  case.  Clearly,  if  no  replicated  objects 
is  considered,  the  worst-case  time  complexity  of  clustering  algorithm  will  reduce  to 
0(c  X  n  ). 
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Chapter  5 


PROOF/L  Front-end  Translation 


The  PROOF/L  front-end  translator  is  the  first  stage  in  the  process  of  transformation 
from  PROOF/L  to  a  target  language.  One  of  the  major  tasks  performed  by  the  front- 
end  translator  is  to  make  the  implicit  parallelism  present  in  PROOF/L  code  explicit. 
In  this  chapter,  we  will  describe  the  translation  rules  which  form  the  basis  for  the  front- 
end  translator.  These  rules  cover  a  number  of  different  functional  forms  and  some 
special  node  insertions,  such  as  the  copy  node.  The  UNIX  tools:  lex  and  yacc  are  used 
for  the  implementation  of  the  front-end  translator.  An  example  program  Bounded 
Buffer  is  used  to  illustrate  the  translation  from  PROOF/L  to  the  IPR.  We  will  also 
explain  different  kinds  of  detections  and  reconstructions  for  control  constructs,  such 
as  if  and  while,  and  the  method  for  generating  the  resultant  IPR. 


5.1  Syntax  Rules  of  PROOF/L 


In  this  section,  we  will  present  the  concrete  syntax  rules  of  PROOF/L.  The  syntax 
rules  include  the  functional  forms  in  PROOF  such  as  a  (apply  to  all),  ^  (distributed 
3'Pply)i7  (filter),  while  (loop),  if  (  conditional)  and  R  (  pseudo  function).  The  syntax 
also  includes  object-oriented  features  such  as  inheritance  and  body  of  active  as  well 
as  pseudo  active  objects.  Complete  concrete  syntax  rules  are  listed  as  follows: 
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proofl 

program  name  :  classJist  objJist  bodyJist  end 

bodyjist 

body.def  /  /  bodyjist 

1 

body^def 

body^def 

-¥ 

body  of  object  name  :  func 

objJist 

— + 

objJist  obj.def 

1 

obj^def 

obj-def 

— > 

active^opt  c-object  name-list :  instance  of  name  ins-opt 

ins-opt 

1 

(  name-list  ) 

active^opt 

1 

active 

T 

1 

pseduo  active 

classjist 

1 

class-list  class-def 

T 

class-def 

class.def 

class  name  class-ins  locaLdataJist  super-class  method-def  end  class 

class^ins 

1 

(  dclnJist ) 

super.class 

1 

— ♦ 

1 

superclass  :  name  (  dcln-list )  inherit  :  inheriLopt 

inherit-opt 

1 

— ♦ 

all 

1 

name-list 

namtJiat 

name  ,  name-list 

T 

name 

locaLdataJist 

1 

composition  locaLdata 

locaLdata 

1 

dcln  X  locaLdata 

T 

dcln 

methodLdef 

method  method-def 

1 

method 

method 

— » 

method  name  (  method-io  )  guardLdcln  expression  func 

guard.dcln 

— » 

1 

guard  (  booLexp  ) 

methodLio 

1 

1 

input-list  —  >  output-list 

inpuLlist 

1 

dcln-list 

dclnJist 

— + 

dcln  ,  dcln-list 

1 

dcln 

dcln 

— ♦ 

name  :  dato-type 

1 

name  :  class-name 

1 

name  :  lisLopt  (  dato-type  ) 

1 

name  :  lisLopt  (  name  ) 

1 

name 

1 

dato-type 

datnAype 

int 

1 

boolean 

output 

dcln 
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class-name 

name 

lisLopt 

—4 

list  *  lisLopt 

1 

list 

output-list 

— ► 

output ,  output-list 

1 

1 

output 

insLlist 

1 

inst ,  insLlist 

T 

inst 

inst 

—4 

name  =  func 

func 

—4 

A  (  name  )  func 

1 

;  (  func-list  ) 

1 

let  name  =  fitnc  in  func 

1 

object  name  (  insLlist  ) 

1 

a  name  [  func-list  ] 

1 

P  [  func-list  ]  [  func-list  ] 

1 

7  [  booLexp-list  ]  [  func-list  ] 

1 

stmt 

stmt 

—4 

while  (  booLexp  ,  func  )  func 

1 

if  (  booLexp  y  func  ,  func  )  func 

1 

R  [|  name  |]  func 

1 

exp 

exp 

aexp 

1 

exp  aexp 

1 

binop  aexp  aexp 

booLexp 

—4 

booLop  aexp  aexp 

1 

not  booLexp 

1 

true 

1 

false 

1 

A  (  name  )  booLexp 

1 

(  booLexp  ) 

aexp 

name 

1 

integer 

1 

NIL 

1 

string 

1 

(  func-list  ) 

1 

[  func-list  ] 

booLop 

-♦ 

= 

1 

<> 

1 

< 

1 

<= 

1 

> 

1 

>= 

func-list 

func  f  func-list 

T 

func 

(lambda  abstraction) 

(  sequential ) 

(object  instantiation) 
(alpha  function  form) 
(beta  function  form) 
(gamma  function  form) 

(while  loop) 

(conditioanl) 

(Pseudo  function  R) 


(List) 

(Boolean  Operators) 
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binop 


+  (Binary  Operators) 


I  ♦ 

I  / 

I  mod 


5.2  Intermediate  Program  Representation 

IPR  is  a  directed  graph  G  =  (V,  E)  in  which  V  is  a  set  of  nodes  and  £7  is  a  set  of 
directed  edges.  A  node  represents  a  computation  or  data  object^.  An  edge  (u,t;) 
represents  dataflow  from  node  u  to  node  v.  V  can  be  divided  into  three  tjrpes  - 
computation  node,  control  construct  node  and  list  handling  node. 

A  computation  node  represents  a  ftmction  receiving  input  value(s)  and  generating 
output  value(s).  These  functions  are  free  from  side  effect,  that  is,  they  always  produce 
the  same  result  when  the  same  input  values  are  given. 

Computation  nodes  include  basic  numeric  and  boolean  operators,  constant,  id  and 
copy  nodes. 

•  Numeric  and  boolean  operators  includes  operators  such  as  ♦,  +,  —,  =,  <,  >. 

t  A  constant  node  represents  a  constant  generator  which  produces  the  same  spec¬ 
ified  value.  There  is  no  input  to  this  node. 

•  An  id  node  represents  an  identity  function  which  always  returns  the  same  value 
as  its  input  value. 

•  A  copy  node  represents  a  duplicator,  which  receives  an  input  and  produces  an 
appropriate  number  of  copies  having  the  same  value  as  its  input. 

Control  construct  nodes  are  needed  to  specify  the  control  flow  among  functions. 
Although  the  data  flow  dependency  relationship  is  the  dominating  factor  in  dictating 
the  execution  flow  of  the  program,  in  order  to  represent  control  functions,  such  as  if 
and  while,  we  need  the  select,  distributor  and  merge  nodes. 

•  A  select  node  represents  a  conditional  construction  function.  It  receives  input 
data  t’l,  t2, . . . ,  tn  ^d  control  data  c  and  returns  an  input  t,  as  an  output  according 
to  the  value  of  control  data  c. 

•  A  distributor  node  represents  a  conditional  construction  fvmction.  It  receives  in¬ 
put  data  t  and  control  data  c  and  passes  t  to  one  of  the  output  ports  Oi ,  03, .  • . ,  On 
according  to  the  value  of  c. 


‘0*t*  object  con  olao  be  ccmeidered  ae  a  epecial  caee  of  computation. 
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•  A  merge  node  represents  a  nondeterministic  selector,  which  receives  an  arbitrary 
number  of  input  data  at  a  time  and  returns  one,  which  arrives  first  to  it.  If  more 
than  one  input  arrives  at  the  same  time,  they  are  chosen  in  an  arbitrary  order. 


c 


11  a  in  11  12  in 


Figure  5.1:  The  IPR  forms  of  select,  distributor  and  merge  node. 

There  are  two  kinds  of  list  handling  nodes:  construct  node  and  split  node. 

•  construct  node  receives  one  or  more  input  values  and  make  them  as  a  list. 

•  split  node  receives  a  list  as  an  input  and  splits  that  list  into  values. 

5.3  lYanslation  Rules  ;  From  PROOF/L  to  IPR 

In  this  section  we  will  present  the  translation  rules  from  PROOF/L  to  the  IPR. 
Different  syntax  rules  in  Section  5.1  will  be  translated  to  different  kinds  of  IPR  form 
depending  on  their  semantics.  For  example,  the  a  fimctional  form’s  semantic  meaning 
is  "apply  to  all".  Therefore,  we  need  to  translate  "apply  to  all"  into  a  corresponding 
IPR  to  represent  its  semantics.  In  some  cases,  some  special  nodes  need  to  be  inserted, 
such  as  latch,  split  and  copy.  All  these  translation  rules  is  presented  in  this  section. 
For  each  translation  rule,  we  will  first  present  the  functional  form  in  PROOF/L  and 
then  the  corresponding  IPR  form. 

1.  Function  Application  -  apply  function  func  to  ei  cj  ...  Cn 
func  Cl  cj  ...  Cn  where  n  >  1 


•1 


Func 

~r 


Figure  5.2:  The  corresponding  IPR  for  function  application. 
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2.  List  -  a  list  consists  of  ci  6}  ...  Cn 
[  Cl,  62,  ...  ,  Cn  1  where  n  >  1 


i 


Figure  5.3:  The  corresponding  IPR  for  list. 

3.  a  Functional  Form  -  apply  to  all 
a  func  [ci,  62,  ...  ,e„] 

el  e2  en 


1 


Figure  5.4:  The  corresponding  IPR  for  a  functional  form. 
4.  Functional  Form  -  distributed  apply 

^  [fit  fi,  •••  »/n]  (ci,  62,  ...  ,C„] 


* 


Figure  5.5:  The  corresponding  IPR  for  0  functional  form. 
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5.  7  Functional  Form  -  filter 
7  [^»  •••  »^n]  [®1)  Cjj  •••  >Cn] 


1 


Figure  5.6:  The  correspondiug  IPR  for  7  functional  form. 


6.  Lambda  Abstraction 
A  x.exp 


X 


•xp 

1 

Figure  5.7:  The  corresponding  IPR  for  lambda  abstraction. 

The  dashed  box  will  be  replaced  by  the  actual  definition  of  exp. 

7.  While  Functional  Form  -  while  b  is  true,  continue  applying  e  to  x 
while(b,e)  x 


X 


Figure  5.8:  The  corresponding  IPR  for  while  functional  form. 

The  dashed  boxes  will  be  repl£Lced  by  the  actual  definitions  of  b  and  e. 
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8.  If  Functional  Form  -  if  b  is  true,  then  apply  Cthen  to  e;  otherise,  apply  CeUe  to  e. 
if(b,e(^en>Cei«e)  ® 


Figure  5.9:  The  corresponding  IPR  for  t/ functional  form. 

The  dashed  boxes  will  be  replaced  by  the  actual  definitions  of  b,  Cthen  ^d  teise- 
9.  Split  Node  Insertion 

class  obj  .class 

composition  ci  x  cj  x  . . .  x  Cn 

method  fooobj  :  obj  .class 
« 

end  class 

Although  objects  can  be  parameters  of  a  method,  the  operations  of  a  method  of 
an  object  mostly  apply  to  the  local  data  of  the  object.  Hence,  in  this  case  we 
first  split  the  object  into  its  local  data.  The  IPR  form  is  : 


o*>l 


Figure  5.10:  The  corresponding  IPR  for  spilt  node  insertion. 
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10.  Copy  Node  Insertion 

Some  different  functions  may  use  the  S2une  input  as  shown  in  the  following  figu 


XXX 


Figure  5.11:  Functions  fl,  (2,  &  have  the  same  input  x. 


In  this  case,  a  copy  node  is  inserted  into  the  IPR  form. 


X 


Figure  5.12:  The  resulting  IPR  after  a  copy  node  is  inserted. 


11.  Sequential  Function  Form  -  executes  /i,  /2,  ...  ,  /„  sequentially. 
;(  fl,  f2,  ,  fn  )  where  n  >  1 


Figure  5.13:  The  corresponding  IPR  for  sequential  functional  form. 
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The  latch  node  is  a  special  node  just  for  control  flow  usage.  This  node  has  two 
inputs,  control  and  data-in.  The  data  from  data-in  cannot  go  through  the  node 
unless  the  control  line  is  fired. 

Control  Data-In 

T 

*  Latch 

Data-Out 

Figure  5.14:  The  input/output  of  a  latch  node. 


5.4  Textual  Form  of  IPR 

IPR  is  a  kind  of  graphical  form  and  a  typical  node  in  IPR  is  shown  in  Figure 
5.14.  We  need  a  textual  form  in  order  to  save  IPR  into  files.  In  this  section,  we 
introduce  the  textual  form  of  IPR. 


il  12  in 


Figure  5.15:  A  typical  IPR  node. 

where,n  is  the  number  of  this  node,  f  is  the  function  of  this  node,  such  as  con¬ 
structor,  selector  ...,  ii„k  ^  1, 2, . . . ,  n  are  the  input  nodes,  and  o*,  fc  =  1, 2, . . . ,  n 
are  the  output  nodes. 

This  node  can  also  be  represented  in  the  following  tabular  form: 


Node  Function  Inputs  Outputs 


n 


f 
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Now  we  would  like  to  present  a  bounded  buffer  as  an  example  to  illustrate  our 
process  of  translation.  First,  we  will  present  the  PROOF/L  source  code  for 
bounded  buffer  and  then  show  how  to  apply  the  translation  rules  introduced  in 
Section  5.3.  Finally,  the  resultant  IPR  output  is  presented. 


The  PROOF/L  code  of  the  example  Class  Bounded_buffer  is  given  as  follows: 


clan  Bound«<l.Buil«r(it«Btyp«,  size) 

composition  stor* :11st  (it*atyp«)  x  count :int 
method  put  bui  x 

guard(<  buf. count  size) 
expression 

j9C(sppsnd-right  x) .  inc]  Cbul. store. bul. count] 
method  get  buf 

gpiard(>  bof. count  0) 
expression 

C  retail. dec]  Cbul. store ,bul. count] .  head (buf  .store)  ] 

end  class 

This  is  a  complete  definition  of  a  class  in  PROOF/L.  Now,  we  would  like  to 
present  the  corresponding  IPR  of  each  method. 

•  Method  get 

method  get  buf 

guard(>  buf  .count  0) 
expression 

C  retail, dec]  Cbui. store, bol. count] >  bead (bul .store)  ] 


The  parameter  is  an  object,  and  according  to  the  translation  rule  (9)  of 
Section  5.3  a  split  node  is  inserted  to  split  the  object  into  its  local  data, 
store  and  count.  The  output  of  method  get  is  a  list,  and  hence  rule  (2)  is 
applied.  The  first  element  of  this  list  is  &  0  fimctional  form  and  hence  rule 
(4)  is  applied.  The  second  element  is  an  ordinary  function  application  and 
rule  (1)  is  used.  We  can  see  both  tail  and  head  use  the  buf.store  as  input. 
According  to  rule  (10),  a  copy  node  is  inserted.  The  resultant  IPR  form  of 
method  get  is  in  Figure  5.16.  The  corresponding  textual  form  is  given  as 
follows: 


Node 

FVinction 

Inputs 

Outputs 

1 

CONSTRUCT 

2,6 

OUTPUT 

2 

CONSTRUCT 

3,4 

1 

3 

tsul 

5 

2 

4 

dec 

7 

2 

5 

COPY 

7 

3,7 

6 

SPLIT 

8 

5,4 

7 

head 

5 

1 

8 

buf 

INPUT 

8 
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biif.coiiiit 


Figure  5.16:  IPR  for  method  get. 


•  Method  put 

method  put  bni  x 

guard(<  buf. count  size) 
expression 

/9C(spp«ndxight  z),  inc]  [buf  .store, buf. count] 

There  are  two  parameters  of  method  put.  The  object  parameter  buf  should 
be  first  split  into  its  local  data.  The  body  of  method  put  is  m2unly  a  0 
functional  form;  therefore,  apply  rule  (4).  The  application  of  append-right 
and  inc  is  translated  by  rule  (1). 

The  resultant  IPR  is  in  Figure  5.17  and  the  corresponding  textual  form  is 
shown  as  follows  : 


Node 

Function 

Inputs 

Outputs 

1 

CONSTRUCT 

2,4 

OUTPUT 

2 

append-right 

3,5 

1 

3 

X 

INPUT 

2 

4 

inc 

5 

1 

5 

SPLIT 

5 

2,4 

6 

buf 

INPUT 

5 
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Figure  5.17:  IPE  for  method  put . 


5.5  Implementation  of  PROOP/L  Front-End  Translator 

In  this  section,  the  implementation  of  the  PROOF/L  front-end  translator  will  be 
presented.  The  following  figure  is  the  architecture  of  the  front-end  translator. 


Lexical  Rules  Syntax  Rules 

(Regular  Expresstion}  (  Oiammar) 


Figure  5.18:  Architecture  of  the  PROOF/L  FVont-End  IVanslator. 

There  are  four  main  modules:  scanner,  parser,  code  generation  and  symbol  table 
handler.  The  scanner  and  parser  are  coded  with  the  aid  of  UNIX  language  tools  lex 
and  yacc.  We  will  present  each  of  these  four  modules  separately. 
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5.5.1  Lexical  Analyzer 


The  function  of  a  lexical  analyzer  is  to  group  the  input  character  stream  into  a  token 
stream  and  as  input  of  the  latter  parser  phase.  A  token  is  a  basic  element  of  parsing. 

In  this  part,  we  use  the  language  tool  lex  generating  the  code  of  lexical  anadyzer. 
We  can  input  the  lexical  rules(  in  regular  expressions  )  to  lex  and  it  will  generate  the 
corresponding  finite  state  machines  for  lexical  analysis. 

The  input  format  of  lex  is  divided  into  three  parts: 


<definitions> 

%% 

<rule3> 

%% 

<programmer  subroutines> 


The  first  part  <definitions>  and  third  part  <programmer  subroutines>  are  op¬ 
tional. 

In  the  first  part  definitions,  we  can  specify  some  sets  of  the  lexical  rules  in  the  next 
part.  For  example, 


letter 

digit 

letter  jor.digit 
sign 


[a-zA-Z] 

[0-9] 

(a-zA-Z_0-9] 

1+-] 


In  the  second  part  <rules>,  we  can  use  these  defined  sets  to  express  the  lexical 
rules.  For  example,  the  lexical  rules  for  integer  numbers: 


digit-1-  { 

yylval.yint  =  atoi(yytext); 
return  token(INTEGER); 


The  left-hand  side  part  is  the  regular  expression  of  an  integer  and  the  right-hand 
side  part  is  the  corresponding  actions  of  an  integer  token:  converts  the  text  into  the 
number  and  return  a  token  INTEGER. 

The  last  part  <programmer  subroutines>  consists  of  some  C  routines  written  by 
users. 
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Another  part  of  this  module  is  the  screener.  The  fimction  of  screener  is  to  distin¬ 
guish  key  words  from  identifier  because  both  are  the  same  in  structure  and  cannot 
distinguish  by  regular  expressions.  Here  we  only  maintain  a  sorted  key  word  table 
2uid  use  binary  search  to  accomplish  this. 


5.5.2  Parser 

The  fimction  of  a  parser  is  to  check  the  correctness  of  input  and  then  generate  the 
corresponding  abstract  syntax  trees.  Here  we  use  another  language  tool  yacc  to 
generate  the  code  for  parsing. 

The  input  format  of  yacc  is  similar  to  that  of  lex.  It  also  consists  of  the  same  three 
parts  (  definitions,  rules  and  programmer  subroutines)  and  they  are  also  separated  by 
two  Only  the  second  part  is  compulsory  and  the  other  two  are  optional. 

The  first  part  is  the  definitions.  We  need  to  give  the  definitions  of  tokens,  return 
types,  priority  between  operators  and  start  rule  of  the  grammar.  For  example, 


%token 

INTEGER 

%type 

<yJnt> 

INTEGER 

%start 

«  «  • 

proofl 

The  second  part  is  the  most  important  part,  including  grammar  rules  and  corre¬ 
sponding  actions.  For  example, 

proofl  :  PROGRAM  ID  COLON  classJist  objJist  bodyJist  END 

{ 

body(); 

} 

I 


This  is  the  main  grammar  rule  of  a  PROOF/L  program.  It  begins  with  the  reserved 
word  program  and  the  name  of  this  program.  After  a  the  rest  of  the  program 
is  the  class  declarations  and  object  declarations.  Finally,  it  is  the  list  of  bodies  of 
active  objects  and  ended  with  the  reserved  word  end.  Similar  to  /car,  between  a  pair 
of  and  is  the  corresponding  action  part  of  this  rule.  For  the  example  above, 
the  2u:tion  is  calling  the  function  body()  to  build  the  object  body  list. 

The  grammar  rules  part  has  been  presented  in  this  section.  The  corresponding 
actions  mostly  are  the  code  generation.  Code  generation  will  generate  different  IPR 
according  to  different  syntax. 

The  rest  of  the  actions  we  symbol  table  manipulations.  They  include  the  operations 
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of  method  table,  class  table  and  object  table.  These  will  be  discussed  later. 


5.5.3  Code  Generation 

The  code  generation  part  is  mainly  an  implementation  of  those  translation  rules.  It 
can  be  summarized  in  the  following  procedure: 

1.  Recognize  the  syntax  structure  and  select  the  corresponding  translation  rule. 
This  is  actually  the  part  of  parser  and  it  is  recursively  executed. 

2.  According  to  the  selected  translation  rule,  generate  the  component  IPR  nodes. 
Label  every  node  with  a  unique  sequence  number.  This  number  is  used  to  build 
the  input /output  relation  among  other  nodes. 

3.  Build  the  input/output  relations  among  nodes.  Here  every  IPR  node  is  attached 
with  two  arrays  for  its  input  and  output  nodes  respectively.  If  its  input  or 
output  are  from  outside,  i.e.,  parameters  of  method  or  it  is  a  constant  node,  a  -1 
is  assigned  to  be  its  input  or  output. 

'I.  Analyzes  the  connected  nodes  and  adds  SPLIT  and  COPY  node(s)  if  they  are 
necessary.  A  COPY  node  is  added  when  there  is  a  common  input  of  severzd  nodes. 
For  example,  the  parameters  of  method  and  boimd  variables  of  A  functional  form 
are  all  common  inputs.  A  SPLIT  node  is  added  when  an  object  is  a  parameter 
of  its  own  method  and  an  object  is  used  in  its  own  body.  Because  an  object  is 
implemented  as  a  list,  the  list  must  be  split  first  in  order  to  use  its  local  data. 

5.  Perform  the  data  dependency  analysis  for  control  constructs.  IPR  is  basically 
an  intermediate  form  in  data  dependency.  However,  for  those  control  constructs, 
such  as  while  and  if,  they  may  or  may  not  have  data  dependency  between  nodes. 
Hence,  we  should  first  find  out  whether  data  dependency  exists.  This  can  be  done 
by  searching  the  bounded  variable  in  the  then-part  and  else-part  of  if  functional 
form  or  the  loop  body  of  the  while  functional  form.  If  bounded  variable  cannot 
be  found,  this  is  pure  control  dependency.  Similar  to  the  sequential  functional 
form,  we  can  use  LATCH  node  to  enforce  control  dependency.  Therefore,  for 
those  pure  control  dependency  in  these  control  constructs,  LATCH  nodes  are 
added  to  its  inputs  to  enforce  control  flow. 


5.5.4  Symbol  Table  Handling 

Because  PROOF/L  is  an  object-oriented  applicative  language,  the  only  symbols  we 
need  to  deal  with  are  objects.  To  maintain  the  symbol  table  of  objects,  first  a  class 
table  should  be  maintained  and  every  entry  in  the  class  has  two  lists  to  point  to  its 
method  list  and  local  data  list  respectively.  When  an  object  is  declared,  an  entry  is 
added  to  the  object  table  and  this  entry  has  a  pointer  to  its  class  in  the  class  table.  We 
can  use  this  to  check  if  every  declared  object  belongs  to  a  declared  class,  and  whether 
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local  data  access  amd  method  invocations  are  legal.  For  those  active  or  pseudo  active 
objects,  their  entries  in  the  object  table  also  have  pointers  to  the  body  list. 

The  relations  among  the  tables  are  shown  in  Figure  5.19. 


Figure  5.19:  The  structure  of  the  symbol  table. 

5.6  Limitations 

The  current  implementation  has  the  following  limitations; 

•  Built-in  data  types:  Now  only  integer  is  supported.  In  the  future,  for  better 
programming  support,  more  built-in  data  types  should  be  supported  such  as 
floating  point,  character  and  string. 

•  Abstract  data  types:  In  the  current  version,  only  list  is  supported.  To  a  wider 
range  of  applications,  more  abstract  data  types  should  be  supported.  For  ex- 
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ample,  for  supporting  scientific  computing  such  as  matrix  and  vector  processing, 
arrays  should  be  supported. 

•  Type  Checking:  Because  currently  only  integer  is  supported,  a  very  simple  type 
checking  is  implemented  in  the  parser. 

•  1/0  facilities:  Now  I/O  facilities  is  not  supported  in  the  language.  Because 
PROOF/L  is  an  object-oriented  language,  in  order  to  add  I/O  facilities,  we  can 
implement  some  I/O  objects,  like  cin  and  cout  for  standard  I/O  in  C-i-f. 

•  Inheritance:  Although  the  inheritance  syntax  is  supported  in  the  parser,  the  part 
of  the  symbol  table  has  not  yet  been  fully  implemented.  However,  this  can  be 
accomplished  by  adding  some  hierarchical  structures  and  duplicate  mechanism. 
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Chapter  6 


Grslin  Size  Analysis 


The  goal  of  grain  size  analysis  is  to  reduce  the  completion  time  of  the  program  by  min¬ 
imizing  the  communication  time  without  sacrificing  parallelism.  When  we  discussed 
partitioning  in  Chapter  4,  we  focused  on  exploring  parallelism  among  the  objects. 
When  we  consider  grain  size  determination,  we  will  find  proper  grain  sizes  within 
each  object.  In  other  words,  we  consider  each  object  as  an  independent  program 
throughout  this  chapter^. 

In  this  chapter,  we  will  present  three  grain  size  determination  algorithms  for  three 
patterns  of  parallelism:  tree  parallelism,  graph  parallelism  and  pipelined  parallelism. 
For  tree  parallelism,  we  will  present  an  efficient  heuristic  grain  size  determination 
algorithm  and  show  that  this  algorithm  can  find  optimal  grain  si^  in  certain  cases. 
We  will  then  generalize  this  algorithm  for  the  case  of  graph  parallelism.  In  both  cases, 
we  compare  the  results  of  o\ir  approach  with  the  existing  grain  size  determination  ap¬ 
proaches.  We  will  also  present  a  grain  size  determination  algorithm  in  case  of  pipelined 
parallelism.  Then,  we  will  describe  a  method  to  modify  the  IPR  form  to  incorporate 
the  information  we  have  obtained  from  grsdn  size  analysis  and  partitioning. 

In  most  of  the  approaches  to  partitioning  and  allocation  of  the  tasks,  or  the  schedu- 
larable  units  in  an  object,  information  regarding  the  execution  times  of  the  tasks  and 
the  commvinication  times  between  the  tasks  is  assumed  to  be  available  as  the  in¬ 
put,  and  thus  the  problem  of  obtaining  such  information  has  not  been  addressed 
[37,  38,  44].  However,  in  order  to  perform  the  grain  size  analysis  based  on  the  tradeoff 
between  parallel  execution  and  communication  overhead,  we  need  to  estimate  the  ex¬ 
ecution  time  of  each  node  in  IPR  and  the  communication  time  between  two  adjacent 
nodes.  In  our  approach,  we  will  obtain  the  information  on  the  execution  and  the 
communication  time  by  estimating  the  execution  time  for  the  simple  nodes  defined  in 
IPR  and  the  communication  time  by  examining  the  type  of  information  of  the  data 
being  transmitted.  The  estimation  can  be  done  statically  by  analyzing  the  assembly 
language  code  generated  for  these  simple  nodes. 

'In  fact,  our  grain  tiae  detennination  ajpproachea  can  alao  be  uaed  for  object  terel  parallelinn,  if  we  have  the 
information  regarding  the  execution  time  of  the  taaks  and  communication  time  between  the  talks. 
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6.1  Grain  Size  Determination 


In  this  section,  we  will  discuss  the  existing  griun  size  determination  strategies  and  the 
assumptions  we  will  have  for  our  grain  size  analysis. 

The  existing  grain  size  determination  strategies  can  be  divided  into  two  categories 
based  on  how  the  grain  size  is  determined:  programmer  control  and  automatic  deter¬ 
mination.  In  programmer-controlled  approaches,  the  programmer  is  fully  responsible 
for  determining  the  grain  sizes  as  well  as  explicitly  expressing  the  parallelism.  The 
programmer  can  use  parallel  language  constructs  indicating  how  the  tasks  are  exe¬ 
cuted  in  parallel.  When  a  programmer  has  specific  information  about  the  behavior 
of  a  program,  he  can  determine  the  sizes  of  tasks.  When  that  program  is  ported  to 
a  different  parallel  processing  system,  the  sizes  of  tasks  need  to  be  changed  to  better 
fit  the  new  processors.  In  addition,  it  may  not  be  easy  for  the  programmer  to  make 
decisions  on  the  sizes  of  the  tasks  due  to  lack  of  information. 

On  the  other  hand,  in  the  category  of  automatic-determination  approaches  [48, 
52,  53,  54],  grain  sizes  are  determined  automatically.  This  category  of  approaches 
can  be  further  divided  into  two  classes:  compiler  approach  and  run-time  approach. 
In  a  compiler  approach,  the  programmer  does  not  provide  any  information  regarding 
the  granularity.  During  compilation,  heuristics  are  used  to  statically  determine  the 
sizes  of  tasks  [48,  53,  54).  One  disadvantage  of  this  kind  of  approaches  is  that  some 
information  may  not  be  available  before  the  run-time. 

In  a  run-time  approach,  only  simple  heuristics  can  be  applied  to  determine  the  sizes 
due  to  costly  overhead.  For  example,  as  in  [52],  each  recursive  function  call  creates 
a  new  task  to  be  assigned  to  a  processor.  In  this  case,  since  functional  programming 
involves  frequent  recursive  function  calls,  it  is  likely  that  too  many  small  tasks  will 
saturate  the  system.  In  general,  such  a  run-time  approach  ignores  the  size  of  tasks 
under  the  assumption  that  there  are  reasonably  many  processors  available.  The  auto¬ 
matic  determination  approach  looks  more  promising  since  the  programmer  does  not 
need  to  worry  about  the  grain  size  at  all. 

In  our  approach,  the  grain  size  is  determined  based  on  the  analysis  of  the  execution 
2md  communication  time  which  can  be  obtuned  during  the  compilation  time. 

We  make  the  following  assumptions  about  the  underlying  parallel  processing  sys¬ 
tems: 

•  The  system  is  an  MIMD  machine  consisting  of  fully  connected  identical  processors 
having  the  same  processing  capability. 

•  Each  processor  has  a  capability  of  performing  program  execution  and  I/O  simul¬ 
taneously. 

•  The  communication  cost  between  two  processors  depends  only  on  the  data  size 
to  be  transmitted.  Currently,  we  ignore  the  time  required  to  set  up  the  commu¬ 
nication. 
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PI  P2  P3  P4 


Figure  6.1:  An  IPR  representation  for 


•  Communic&tion  cost  between  two  tasks  residing  on  the  same  processor  is  negli¬ 
gible  and  hence  is  not  considered. 

The  following  notations  will  be  used  in  this  section: 

Definition  6.1.1  The  execution  time  for  a  node  p,  denoted  by  e(p),  is  the  amoimt 
of  time  required  to  complete  the  task  represented  by  p  without  being  interrupted. 

Definition  6.1.2  The  communication  time  between  two  nodes  p  and  q,  denoted  by 
c(p>q)  is  the  amount  of  time  required  to  transmit  data  from  p  to  q  under  the  as¬ 
sumption  that  p  and  q  are  assigned  to  adjacent  processors. 


Definition  6.1.3  The  completion  time  for  a  program  $  with  k  processors,  denoted 
by  D($,A;),  is  the  amount  of  time  required  to  finish  all  the  tasks  and  communication 
in  the  program  $  with  the  k  processors. 

Suppose  that  we  have  a  PROOF/L  program,  called  $i,  such  as  a(  b  (dCui) , 
eCua))  >  c  (f  (vs)  ,  g(v4))).  The  IPR  for  is  shown  in  Figure  6.1.  The  graph  con¬ 
sists  of  seven  tasks,  a,  b,  c,  d,  e,  f,  g,  and  six  edges  representing  data  prece¬ 
dence  relations  among  the  tasks.  This  tree-type  parallelism  is  a  typical  form  resulting 
from  a  divide-and-conquer  algorithm. 
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Assuming  that  there  are  enough  processors,  one  for  each  task.  A  simple  approach 
to  assign  the  tasks  to  the  processors  would  be  to  assign  four  tasks  d,  e,  f  and 
g  to  four  different  processors  PuPitP^  and  ^4,  respectively.  After  completing  the 
execution  of  these  four  tasks,  P\  and  P3  continue  the  execution  of  the  tasks  b  and  c, 
respectively.  Then,  Pi  executes  a  to  complete  the  execution.  In  this  case,  D($i,4) 
can  be  calculated  as  follows: 

D($i,4)  =  5  +  10  +  2  +  10  +  1  =  28 
This  result  is  not  desirable  since  if  we  only  utilize  one  processor, 

=ES=ae(x) 

=  5  x4  +  2  x2  +  l 
=  25. 

Thus,  D($i,l)  <  D($i,4)  and  there  is  no  gain  in  parallel  processing  because  the 
communication  overhead  has  overshadowed  the  gain  of  parallel  execution. 


6.2  Tree  Parallelism 

We  will  first  define  the  terms  to  be  used  for  tree  parallelism. 

Definition  6.2.1  A  one-level  subtree  is  a  subset  of  nodes  vq,  Vi, . . . ,  v„  in  a  tree  such 
that  Vo  is  a  parent  node  of  all  the  nodes  vi,  V3, . . . ,  v„. 

In  the  following,  we  call  a  one-level  subtree  simply  a  subtree.  The  number  of 
subtrees  in  a  tree  is  the  same  as  the  number  of  non-leaf  nodes. 

Definition  6.2.2  A  task  precedence  tree  Tp  is  a  tree  in  which  each  node  represents  a 
task  and  each  edge  specifies  the  data  dependency  relation  between  two  nodes. 

Parallelism  obtained  from  the  divide-conquer  strategy  can  lead  to  the  parallelism 
of  a  tree  pattern  and  thus  be  represented  by  a  task  precedence  tree  Tp. 

Definition  6.2.3  A  gain  tree  Tg  of  Tp  is  a  weighted  tree  in  which  each  node,  called 
a  gain  node,  represents  a  subtree  in  Tp  and  each  edge  represents  data  dependency 
relations  among  the  nodes.  Each  gain  node  has  a  weight,  called  gain,  corresponding 
to  the  amount  of  maximum  contribution  to  reducing  the  completion  time  when  the 
nodes  in  the  corresponding  subtree  are  grouped  into  a  node. 

A  gain  for  a  subtree  consisting  of  ni,...,nm  is  denoted  by  GA/7V(ni, . . . ,n„).  Our 
grain  size  determination  approach  can  be  considered  as  a  horizontal  grouping  or  par¬ 
titioning  process  in  that  a  set  of  adjacent  nodes,  i.e.,  a  subtree  is  considered  as  a 
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candidate  for  grouping.  The  essential  part  of  our  grain  size  determination  approach 
is  to  estimate  the  possible  contributions  which  can  be  made  by  grouping  the  adjacent 
nodes.  Our  approach  consists  of  two  parts:  build  a  gain  tree  from  a  given  input  task 
precedence  graph,  and  determine  grain  sizes  from  the  gain  tree.  The  gain  tree  can 
be  built  by  analyzing  each  subtree  in  the  task  precedence  tree  using  the  following 
procedure: 

The  input  to  the  procedure  is  a  subtree,  consisting  of  a  node  s  and  its  child  nodes 
’ll )  •  •  • )  output  is  GAIN{s,  m, . . . ,  rim). 

Procedure  Gain  Analysis 

1  Calculate  the  total  execution  time  for  all  nodes  in  one  processor. 

Let  r  be  the  total  execution  time.  Then, 

T  =  C(^)  +  E,”  1  C(«.)- 

2  Let  <Ti  =  c(nj)  +  c(nj,s),  *  =  1,2, ...,m. 

Find  a  node  nj^,  1  <k  <m,  such  that  at  is  the  second  largest 

among  (Ti,i  =  1, 2, . . . ,  m. 

3  If  +  e(s)  >  T,  then 
GAIN{s,  ni ,  nj, . . . ,  Tim)  =  <7*  +  c(s)  -  t 

else 

GA/iV(s,ni,nj,...,nm)  =  0. 

In  Step  1,  the  total  execution  time  r  is  determined  by  summing  all  the  execution 
time  of  the  nodes  in  the  subtree.  In  Step  2,  calculate  the  time  required  to  complete 
the  computation  represented  by  the  subtree  where  the  nodes  in  that  subtree  are  not 
grouped  together.  Since  one  of  the  child  nodes  and  the  node  s  can  be  scheduled  to 
the  same  processor  and  the  scheduler  can  choose  a  node  rij,  I  <  j  <  m,  such  that 
(7j  is  the  largest,  the  second  largest  schedule  length  <Tk  is  calculated  and  used  as  the 
actual  time  required  to  complete  the  task  represented  by  s, rii, . . . , rim.  In  Step  3,  a 
gain  is  calculated  by  comparing  the  total  execution  time  r  with  the  actual  completion 
time  (Tk  calculated  in  Step  2.  If  there  is  a  positive  gain,  then  the  amount  of  the  gain 
calculated  is  assigned  to  the  gain  node.  Otherwise,  the  gain  is  set  to  zero. 

Step  1  requires  0(1)  time,  Step  2  requires  0{m)  time,  where  m  is  the  number  of 
child  nodes  and  Step  3  requires  0(1)  time.  Thus,  the  time  complexity  of  the  procedure 
Gain- Analysis  is  0(m). 

Now,  we  build  a  gain  tree  Tg  from  a  task  precedence  tree  Tp  using  the  procedure 
Gain-Analysis  in  the  following  manner: 

The  input  to  this  procedure  is  a  task  precedence  tree  and  its  output  is  a  gain  tree. 
Algorithm  6.2.1  Build  Gain  Tree 
For  all  subtrees  t,  do 


Step 

Step 

Step 
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Let  t,  consist  of  a  node  s  and  its  child  nodes  ni, nj, . . . ,  rim- 
Step  1  Call  Gain- Analysis  (s, ni, nj, . . . , rim)* 

Step  2  Connect  the  gain  node  to  the  existing  gain  nodes. 


In  Step  1,  Gain- Analysis  is  called  for  each  subtree  to  determine  the  possible  contri¬ 
bution  to  the  reduction  of  the  completion  time  when  all  the  nodes  in  the  subtree  are 
grouped.  In  Step  2,  the  newly-created  gmn  node  for  the  current  subtree  is  connected 
to  the  existing  gain  nodes.  The  construction  of  the  gain  tree  can  be  done  by  omitting 
leaf  nodes  from  the  original  task  precedence  tree  and  associating  the  gain  calculated 
in  Step  1  with  the  parent  of  the  corresponding  subtree.  Step  1  requires  0(m),  where 
m  is  the  number  of  the  child  nodes.  Since  Step  1  is  executed  for  each  subtree  and 
the  number  of  subtrees  is  bounded  by  the  number  of  non-leaf  nodes  in  the  tree,  the 
algorithm  needs  to  visit  each  node  once.  In  Step  2,  each  node  also  needs  to  be  visited 
once.  Thus,  the  time  complexity  of  Algorithm  6.2.1  is  0(n),  where  n  is  the  number 
of  nodes  in  the  tree. 

Once  the  gain  tree  is  built,  the  grain  size  can  be  determined  by  selecting  groups 
heuristically.  Our  grain  size  determination  is  based  on  the  observation  that  the  con¬ 
tribution  from  the  nodes  close  to  the  root  node^  propagates  to  the  other  nodes.  In 
order  to  illustrate  this,  suppose  that  we  have  simple  gain  trees  as  shown  in  Figure  6.2 
in  which  v,',t  =  1,2,3,  represent  a  set  of  gain  nodes  and  z,y,  z  represent  an  amoimt 
of  the  gain  for  each  node,  respectively.  In  Figure  6.2  (a),  the  precedence  relations  are 
V2  -♦  vj  and  V3  -*  vi.  In  Figure  6.2  (b),  the  precedence  relations  are  vj  -+  vj  and 
vi  -*  V3.  The  goal  of  the  gain  analysis  is  to  select  the  two  gain  nodes  for  grouping 
in  order  to  increase  the  overall  contribution  to  the  reduction  of  the  completion  time. 
Note  that  each  gain  node  in  the  gain  tree  represents  a  contribution  when  a  set  of 
nodes  in  the  subtree  of  the  task  precedence  tree  is  grouped  together.  Thus,  we  mean 
that  grouping  of  a  gain  node  is  to  group  together  a  set  of  nodes  in  the  task  precedence 
tree.  The  overall  contribution  is  determined  as  follows: 

overall  contribution 
=  X  -I-  maz(y,  z),  if  ain(y,  z)<x 
=  min(y,  z),  if  min(y,  z)  >  x 

Thus,  in  case  of  Bin(y,z)  <  x,  if  y  >  z,  vi  and  V2  are  grouped  together,  other¬ 
wise  Vi  and  Va  are  grouped  together.  In  case  of  nin(y,  z)  >  x,  vj  and  V3  are  grouped 
together. 

Note  that  the  rule  for  determining  the  overall  contribution  presented  above  can  be 
applied  to  both  gain  trees  shown  in  Figure  6.2.  It  implies  that  our  gain  size  analysis 
technique  can  be  used  for  the  analysis  of  both  in-tree  form  as  in  Figure  6.2  (a)  and 
out-tree  form  as  in  Figure  6.2  (b).  The  following  is  an  algorithm  to  determine  grain 
sizes  for  tree  parallelism: 

The  input  to  this  algorithm  is  a  gain  tree  T,  consisting  of  a  set  of  nodes  ni,  nj, . . .  n„i, 

^Th«  root  nodo  in  n  tree  ie  •  node  hnving  depth  of  0. 
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(a) 


(b) 


Figure  6.2:  Two  simple  gain  tree  examples:  (a)  in-tree  form,  and  (b)  out-tree  form 


and  its  output  is  a  set  of  grains. 

Algorithm  6.2.2  Determine  Grain  Size 

Step  1  Initialize  all  the  gain  nodes  n,*,  t  =  1, 2, . . . ,  m,  as  ‘ungrouped’. 

Step  2  Sort  n,,  t  =  1,2, . . .  ,m,  in  Ty  using  the  gain  as  a  primary  key  in  descending 

order  and  the  depth  of  each  gain  node  as  a  secondary  key  in  ascending  order. 

Step  3  Get  the  gain  node  n/  whose  gain  is  the  largest  in  the  sorted  list. 

Step  3.1  If  ni  is  not  a  root  node,  and  both  the  parent  of  n/  and 
all  the  sibling  of  n/  are  set  ‘grouped’,  then 
go  to  Step  3.4. 

Step  3.2  If  two  or  more  child  nodes  of  ni  are  set  ‘grouped’,  then 
go  to  Step  3.4. 

Step  3.3  Set  ni  as  ‘grouped’. 

Step  3.4  Delete  nj  from  the  list. 

Step  3.5  If  there  is  a  gain  node  with  a  positive  gain,  then 
go  to  Step  3, 

else 

Stop. 

Algorithm  6.2.2  determines  which  subtrees  need  to  be  grouped  by  analyzing  gains 
of  the  possible  candidates.  Step  1  requires  0(m)  time  to  visit  each  gain  node  once, 
where  m  is  the  number  of  the  gain  nodes.  In  Step  2,  0{mlogm)  time  is  required  to 
sort  m  nodes.  Step  3  must  visit  all  the  adjacent  nodes  of  nj  at  most  once  for  each 
gain  node.  Because  the  number  of  the  adjacent  gsun  nodes  is  the  same  as  the  number 
of  edges  in  ni  and  each  edge  will  be  visited  at  most  twice,  the  overall  time  complexity 
in  Step  3  is  bound  to  0(c)  in  which  c  is  the  number  of  the  edges  in  Tg.  Therefore  the 
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time  complexity  of  this  algorithm  is  0{max{mlogm,  c)).  Note  that  in  case  of  trees, 
the  time  complexity  is  0{mlogm)  since  mlogm  is  always  greater  than  e. 

In  remainder  of  this  section,  we  compare  our  approach  to  McCreary’s  approach  [53] 
to  demonstrate  the  efficiency  of  our  approach.  McCreary  uses  an  algorithm  of  the 
time  complexity  0(n®),  where  n  is  the  number  of  the  nodes  in  the  task  precedence 
tree,  that  decomposes  a  graph  into  a  set  of  clans  that  are  classified  as  primitive, 
linear,  or  independent.  When  the  clans  are  labeled  as  independent,  the  possibility  of 
parallelization  exists.  However,  when  the  clans  are  labeled  as  primitive  or  linear,  they 
are  grouped  together  and  executed  sequentially.  In  order  to  compare  our  approach 
to  McCreary’s,  we  use  the  same  example  used  in  [53],  whose  task  precedence  tree  is 
shown  in  Figme  6.3.  Every  node  has  a  unique  number  within  the  circle  representing 
that  node,  and  a  weight  is  attached  to  the^  node.  The  communication  cost  is  assigned 
to  each  edge.  For  instance,  a  node  9  has  a  weight  1  and  each  of  the  three  edges  has  the 
communication  cost  18.  Using  McCreary’s  approach,  the  schedule  and  its  completion 
time  are  shown  in  Figure  6.4. 

The  problem  with  McCreary’s  approach  is  that  it  begins  to  search  for  the  candi¬ 
dates  for  grouping  from  the  bottom  of  the  tree.  Once  the  grouping  is  done  at  the 
lower  part  of  the  tree,  the  grouping  at  the  higher  level  is  less  likely  to  occur  since 
such  grouping  may  have  to  sacrifice  the  possibility  of  parallel  execution  at  the  lower 
level.  In  addition,  the  key  concept  of  the  graph  decomposition  approach,  clan,  is 
determined  without  using  information  regarding  the  execution  time  of  the  nodes  and 
communication  time  between  the  nodes. 

Our  grain  size  determination  approach  uses  such  information  to  determine  the 
proper  grains  at  the  beginning.  By  analyzing  the  gain  locally  in  each  subtree,  we 
build  the  gain  tree.  Then,  we  first  select  the  largest  gain  node  in  the  gain  tree  as 
the  candidate  for  the  grouping  and  continue  to  select  the  next  largest  gain  node  until 
all  the  gain  nodes  with  positive  gain  are  processed.  The  gain  tree  for  this  example 
is  shown  in  Figure  6.5,  where  a  calculated  gun  is  shown  within  each  gain  node  and 
the  nodes  in  the  task  precedence  tree  shown  in  Figure  6.3  forming  a  gain  node  are 
identified  besides  the  gain  node. 

From  the  information  obtained  in  the  gain  tree,  we  can  determine  five  grains  as 
follows: 


Cl  = 

{ 

1,2, 

9, 

10,  13,  14,  15  } 

C2  = 

{ 

5,  6, 

11 

} 

C3  = 

{ 

7,  8, 

12 

\ 

C4  = 

{ 

3} 

C5  = 

{ 

4} 

Using  these  grains,  we  can  show  the  two  schedules  and  their  completion  time  in 
Figure  6.6,  one  for  four  processors  another  schedule  for  five  processors. 
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clustered 


Figure  6.5:  A  gain  tree  for  the  tree  parallelism  example. 


In  the  following,  we  will  present  a  task  allocation  strategy  which  can  produce  an 
optimal  solution  when  the  following  conditions  are  met: 


•  The  grain  size  analysis  indicates  that  no  more  grouping  of  grains  is  needed  to 
reduce  communication  time  among  the  tasks. 

•  The  task  precedence  graph  is  a  tree. 

•  There  are  a  sufficient  number  of  processors  in  the  system  so  that  the  leaf  nodes 
in  the  task  precedence  graph  can  be  executed  simultaneously,  if  necessary. 

This  algorithm  can  be  considered  as  a  variant  of  list  scheduling,  and  uses  the  concept 
of  the  gain  analysis.  The  input  to  this  algorithm  is  a  task  precedence  tree,  and  its 
output  is  a  set  of  grains. 


Algorithm  6.2.3  TcLsk  allocation  for  tree 

Step  1  Find  a  node  s  whose  child  nodes,  n„  i  =  1, 2, . . . ,  m,  are  all  leaf  nodes. 

K  there  is  no  such  a  node  s,  then  stop. 

Step  2  If  s  has  only  one  child  node  n{,  then 

Group  s,  n/  to  a  new  node  n{  so  that  e(nj)  =  e(s)  +  e(n{). 

Delete  s  and  ni  from  the  tree. 

Attach  y  to  the  place  of  s. 

Go  to  step  1. 

Step  3  For  s  and  n,-,  1  <  »  <  m,  do 
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Figure  6.6:  A  schedule  obtained  from  our  approach:  (a)  for  4  processors,  and  (b)  for  5  processors. 


Step  3.1  Calculate  the  total  execution  time  in  one  processor. 

Let  T  be  the  total  execution  time.  Then, 

T  =  C(s)  +  Efel  C(«.)- 

Step  3,2  Let  <7<  =  c(n,)  +  c(n,-,  s),  i  =  1, 2, . . , ,  m. 

Find  a  node  nj  so  that  aj  is  the  largest 
among  1, 2, . . . ,  m. 

Step  3.3  Find  a  node  nit  so  that  (7k  is  the  largest 

among  <t„ *  =  1, 2, . , .  —  1,7  +  1, . . . ,  m. 

Step  3.4  If  <Tk>.T  —  c(s),  then 

Group  s  and  n„  t  =  1, 2, . . . ,  m,  into  a  new  node  y 
so  that  c(y)  =  r. 

Delete  s  and  n{,  t  =  1, 2, . . . ,  m,  from  the  tree. 
Attach  y  to  the  place  of  a. 

else 

Create  a  group  for  each  n,, 

»  =  1,2,. . .  ,7  —  1,7  +  1, . . .  ,m. 

Group  s  and  nj  to  a  new  node  r 
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so  that  e(r)  =  max(e(nj)  +  e{s),ak  +  e(s)). 
Delete  s  and  n,-,  t  =  1, 2, . . . ,  m,  from  the  tree. 
Attach  r  to  the  place  of  s. 

Step  3.5  Go  to  step  1. 


The  following  theorem  indicates  that  Algorithm  6.2.3  produces  an  optimal  alloca¬ 
tion  imder  certain  conditions. 

Theorem  6.2.1  Algorithm  6.2.3  produces  an  optimal  allocation  when  the  following 
conditions  are  satisfied:  1)  No  more  grouping  is  required  in  the  sense  that  the  grains 
of  proper  size  have  been  obtained.  2)  The  task  precedence  graph  is  a  tree.  3)  There 
are  more  processors  than  leaf  nodes  in  the  task  precedence  graph. 


Proof.  If  Condition  1)  is  satisfied,  <Tk  as  defined  in  Step  3.2  is  always  less  than 
or  equal  to  r  —  c(s)  in  Step  3.4  of  Algorithm  6.2.3.  Thus,  the  else  part  of  Step  3.4 
will  always  be  executed.  Conditions  2)  and  3)  imply  that  the  completion  time  of  the 
node  s  (the  time  required  to  complete  the  node  s  and  its  child  nodes)  only  depends 
on  how  the  node  s  and  its  child  nodes  are  allocated.  Thus,  we  only  need  to  show 
that  Algorithm  6.2.3  generates  an  optimal  solution  for  a  subgraph  of  the  given  task 
precedence  graph. 

To  do  this,  let  the  allocation  produced  by  Algorithm  6.2.3  for  s  and  n,-,*  = 
1, 2, . . . ,  m,  be  A,  and  its  completion  time  be  CU.  Then,  we  assume  that  there  exists 
another  algorithm  which  generates  a  better  allocation  for  s  and  n,-, »  =  1, 2, . . . ,  m,  be 
B  and  its  completion  time  be  In  the  following,  we  will  show  that  there  exists  no 
allocation  B  such  that  Cb  <  Ca:  In  A,  each  ni,  i  =  +  will  be 

allocated  to  different  processors,  and  in  each  cycle  of  Algorithm  6.2.3  only  one  node 
is  chosen  to  group  with  its  parent,  and  the  addition  of  any  other  nodes  to  this  group 
will  not  reduce  the  completion  time  by  Condition  1).  Suppose  in  B  that  a  node  nj>, 
j  ^  j,  is  selected  in  Step  3.2  and  is  grouped  with  its  parent  s.  In  Step  3.3,  nj  will  be 
chosen  instead  of  n*  because  aj  is  the  largest  among  <Ti,i  =  1, 2, . . . ,  m. 


Then,  Cb  =  max[aj  -|-  c(s),c(n^»)  -|-  c(5)].  Since  we  know  that  Oj  >  c(nj»). 


CB  =  <rj  +  c(s). 


(1) 


From  A, 


Ca  =  max[e{nj)  -I-  c(s),  <t*  e(s)]  (2) 

Therefore,  two  cases  need  to  be  considered: 

In  Case  1),  if  e{nj)  -1-  c(s)  >  (r*  -I-  c(s),  then  from  (2) 


61 


Ca  =  e(nj)  +  e(a)  (3) 

Comparing  Cb  in  (1)  and  Ca  in  (3),  we  have  Cb  —  Ca  =  <^j  ~  c(nj)  =  c(nj,s)  >  0. 
Hence,  Cb  ^  Ca- 

In  Case  2),  if  <7*  +  e(s)  >  e(nj)  +  c(s),  then  from  (2) 


Ca  =  <T*  +  e{s).  (4) 

Comparing  Cb  in  (1)  and  Ca  in  (4),  we  have  Cb  —  Ca  =  <^j  +  —  <^k  —  c(s)  = 

(Tj  —  <Tk  >  0  since  <Tj  >  <Tfc.  Hence,  Cb  >  Ca-  This  completes  the  proof  of  the  theorem. 

We  show  the  complexity  of  Algorithm  6.2.3  in  the  following  theorem. 

Theorem  6.2.2  The  time  complexity  of  Algorithm  6.2.3  is  0{n),  where  n  is  the  num¬ 
ber  of  nodes  of  the  task  precedence  tree. 

Proof.  Steps  1,  2  and  3.1  each  requires  0(1)  time  to  complete  each.  Steps  3.2  and 
3.3  require  visiting  m  nodes,  where  m  is  the  number  of  child  nodes.  Thus,  0(m)  time 
is  required  to  complete  these  steps.  Step  3.4  also  requires  visiting  m  nodes,  and  thus 
0(m)  time  is  need^  to  complete  the  step.  Once  they  are  visited,  each  node  will  not  be 
visit^  again  since  they  are  grouped  at  this  step.  Thus,  the  entire  algorithm  requires 
visiting  at  most  once  each  node.  Therefore,  the  time  complexity  of  this  algorithm  is 
0(n).  This  completes  the  proof  of  the  theorem. 


6.3  Graph  Parallelism 


Now  we  extend  our  grain  size  determination  approach  to  the  general  cases  in  which 
the  task  precedence  relation  is  represented  by  a  directed  graph.  We  first  define  the 
depth  and  height  in  the  directed  graph. 

Definition  6.3.1  The  depth  of  a  node  in  a  graph  is  the  length  of  the  longest  path 
from  the  highest  ancestor  of  that  node. 

A  node  having  no  incoming  edge  is  of  depth  0. 

Definition  6.3.2  The  height  of  a  graph  is  the  largest  depth  of  the  graph. 

A  gain  graph  Gg  and  a  task  precedence  graph  Gp  are  defined  in  the  same  manner  as 
the  gain  tree  and  the  task  precedence  tree  except  that  they  are  graphs.  In  the  graph 
cases,  we  also  analyze  gains  to  determine  the  proper  grain  sizes.  We  first  build  a  gain 
graph  and  apply  Algorithm  6.2.2  to  determine  the  grain  sizes.  The  following  is  an 
algorithm  to  build  a  gain  graph: 
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The  input  to  this  procedure  is  a  task  precedence  graph  Gp  and  its  output  is  a  gain 
graph. 

Algorithm  6.3.1  Build  Gain  Graph 

Step  1  Determine  the  depth  of  each  node  in  Gp. 

Step  2  Initialize  all  the  nodes  ‘immarked*. 

Step  3  Let  the  height  of  Gp  be  h. 

For  a  depth  d  =  0  to  h,  do 

For  each  node  s  of  depth  d,  do 
Step  3.1  If  d  >  0,  then 

Find  the  predecessor  nodes  n,-,  t  =  1, 2, . . . ,  m,  of  s 
having  a  depth  d  —  1. 

If  each  t  =  1, 2, . . . ,  m,  is  not  ‘marked’,  then 
Call  Gain-Analysis(s,  ni,  nj, . . . , Um). 

Set  n,,  t  =  1, 2, . . . ,  m,  ‘marked’. 

If  the  out-degree  of  s  is  greater  than  1,  then 
set  s  ‘marked’. 

Connect  the  gain  node  with  the  existing 
gain  nodes. 

Step  3.2  Find  the  successor  nodes  n|-,  t  =  1, 2, . . . ,  m',  of  s 
having  a  depth  d  -I- 1. 

Call  Gain-Analysis(s, nj,  n,, . . . ,  n^,). 

Set  nj,  t  =  1, 2, . . . ,  m\  ‘marked’. 

Connect  the  gain  node  with  the  existing  gain  nodes. 


In  Step  1,  the  depth  of  each  node  can  be  determined  using  a  breadth-first  search 
method.  In  Step  2,  all  the  nodes  are  initially  set  as  ‘unmarked’,  meaming  that  the 
node  is  not  involved  in  any  grouping.  In  Step  3,  the  gain  graph  is  built  by  visiting  each 
node.  First,  a  node  having  the  smallest  depth  is  chosen,  and  its  predecessor  nodes 
are  first  checked  to  determine  whether  all  of  them  are  already  included  in  any  group. 
If  they  were  not  included  in  any  group,  then  GAIN  is  calculated.  The  predecessor 
nodes  are  all  set  ‘marked’,  and  the  node  is  set  ‘marked’  unless  it  has  only  one  successor 
node.  A  gain  node  representing  the  nodes  under  consideration  is  created.  Edges  are 
established  from  the  existing  gain  nodes  to  the  newly-created  gain  node  if  there  is  a 
node  common  to  both  the  new  gain  node  and  the  existing  nodes.  The  similar  steps 
are  applied  to  the  successor  nodes. 


Theorem  6.3.1  The  time  complexity  of  Algorithm  6.4. 1  is  O(nlogn). 

Proof.  Steps  1  and  2  require  visiting  each  node  once,  and  0(n)  time  is  required 
to  complete  the  steps,  where  n  is  the  number  of  the  nodes  in  the  task  precedence 
graph.  The  complexity  of  Step  3  is  dominated  by  the  procedure  Gain- Analysis,  2uid 
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thus  each  pass  of  Step  3  requires  0(m  +  m*)  time,  where  m  and  m'  are  the  number 
of  the  predecessor  nodes  and  the  niunber  of  the  successor  nodes,  respectively.  Step 
3  needs  to  be  executed  for  each  node  in  the  graph.  Because  m  +  is  the  total 
number  of  edges  in  a  node,  each  edge  needs  to  be  visited  at  most  twice  and  thus  the 
time  complexity  of  this  step  is  0(e),  where  e  is  the  number  of  the  edges  in  the  task 
precedence  graph.  Therefore,  the  time  complexity  of  Algorithm  6.3.1  is  0(e).  This 
completes  the  proof  of  the  theorem. 

Once  the  gain  graph  is  biiilt,  we  can  apply  Algorithm  6.2.2  to  determine  the  proper 
grain  sizes.  Then,  the  set  of  grains  obtained  by  applying  Algorithm  6.2.2  can  be  used 
as  the  input  for  the  scheduling  stage. 

Compared  to  McCreary’s  approach  [54],  our  approach  has  the  following  advantages: 

First,  our  approach  is  more  efficient  in  terms  of  the  time  complexity.  As  shown 
above,  the  time  complexity  of  our  approach  is  Olmax(nlognf  c)]  in  which  nlogn  is  for 
Algorithm  6.2.2  and  e  is  for  Algorithm  6.3.1.  Here,  n  and  e  represent  the  number  of 
nodes  and  the  number  of  edges  in  the  task  precedence  graph,  respectively.  The  time 
complexity  of  McCreary’s  approach  is  dominated  by  the  parsing  algorithm  of  O(n^). 

Second,  our  approach  is  more  general  in  the  sense  that  any  acyclic  task  graph  czm  be 
analyzed  for  grain  size  determination.  McCreary’s  method  can  handle  only  regular 
dependency  relations  among  the  nodes,  such  as  the  case  where  the  left  and  right  child 
nodes  have  symmetric  dependency  relations  to  their  parents. 

Third,  our  approach  considers  the  execution  time  of  the  nodes  and  the  communication 
time  between  the  nodes  as  primary  factors  from  the  beginning  of  the  grain  size  analysis 
and  selects  the  best  candidates  from  the  entire  task  precedence  graph  based  on  the 
analysis  of  such  information.  McCreary’s  method,  however,  may  lose  the  opportunity 
of  grouping  at  later  stages  because  the  decomposition  of  the  graph  is  done  without 
using  such  information. 

To  illustrate  our  approach  we  use  the  example  used  in  [54].  The  task  precedence 
graph  for  the  FFT  (Fast  Fourier  Transformation)  is  shown  in  Figure  6.7.  Using 
Algorithm  6.3.1,  we  obtain  the  gain  graph  shown  in  Figure  6.8.  We  also  show  a 
schedule  for  the  FFT  problem  in  the  case  of  four  processors  in  Figure  6.9. 


6.4  Pipelined  Parallelism 

In  exploiting  pipelined  parallelism,  one  important  consideration  is  how  to  divide  a 
program  into  a  set  of  tasks  to  reduce  the  completion  time  of  the  program.  We  call 
each  task  in  the  pipelined  program  as  a  segment  In  the  following,  we  show  that  we 
can  find  the  optimal  size  of  segments.  Note  that  the  optimal  segment  size  correspond 
to  the  optimal  grain  size. 
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Figure  6.7:  A  task  precedence  graph  for  the  Fast  Fourier  IVansformation  problem. 


Definition  6.4.1  The  dominant  segment  in  a  pipelined  program  is  a  task  such  that 
its  completion  time  is  the  l2irgest  among  all  the  tasks  in  the  program. 


The  dominant  segment  dictates  the  completion  time  of  the  entire  pipelined  program 
because  the  processor  for  executing  the  dominant  segment  in  a  pipelined  program  is 
always  busy  once  all  the  segments  are  filled  with  data. 

Definition  6.4.2  All  the  tasks  other  than  the  dominant  segment  are  called  subordi¬ 
nate  segments. 


Definition  6.4.3  The  computation  segment  in  a  pipelined  program  is  a  task  which 
receives  input  data,  executes  the  computation  associated  with  the  segment  and  returns 
the  result. 
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Figure  6.8:  A  gain  graph  for  the  Fast  Fourier  l^ansformation  problem. 


Definition  6.4.4  The  communication  segment  in  a  pipelined  program  is  a  task  which 
passes  data  from  the  preceding  computation  segment  and  to  the  succeeding  compu¬ 
tation  segment. 

Definition  6.4.5  The  one-pass  completion  time  E($,),  defined  for  a  segment  of  a 
pipelined  program,  is  the  amount  of  time  required  to  complete  processing  of  a  datum 
in 

Note  that  the  dominant  segment  can  be  either  a  computation  segment  or  a  commu¬ 
nication  segment,  but  the  neighbors  of  a  communication  segment  are  always  compu¬ 
tation  segments  and  vice  versa.  Computation  segments  and  communication  segments 
have  different  properties.  Computation  segments  can  be  segmented  according  to  the 
data  dependency  relations  among  the  segments  to  reduce  the  one-pass  completion 
time.  However,  communication  segments  cannot  be  so  refined.  This  distinction  im¬ 
plies  that  in  order  to  find  the  optimal  grain  sizes,  we  need  to  begin  the  grain  size 
analysis  from  the  smallest  grain  size  available.  In  our  approach,  the  IPR  presents  the 
smallest  grain  parallelism  we  can  get  in  the  program  statement  level. 
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Figuie  6.9:  A  schedule  for  the  FFT  problem. 


The  goal  of  finding  proper  grain  sizes  in  a  pipelined  program  is  to  reduce  the 
one-pass  completion  time  of  the  dominant  segment  by  either  combining  a  dominant 
segment  with  its  neighboring  segments  or  refining  a  dominant  segment.  However, 
since  the  IPR  presents  as  fine  grain  parallelism  as  possible,  the  refinement  of  the 
dominant  segment  would  not  be  considered.  In  the  following,  we  will  present  an  opti- 
m£il  grain  size  determination  algorithm  for  the  pipelined  parallelism  in  the  sense  that 
the  completion  time  of  a  pipelined  program  is  minimized  for  all  allocations. 

The  input  to  this  algorithm  is  the  execution  time  of  the  computation  segments  in 
a  pipelined  program  and  the  communication  time  of  the  communication  segments, 
and  its  output  is  an  optimal  grain  size.  We  make  the  following  assumptions  on  par¬ 
allel  processing  systems: 

•  The  one-pass  completion  time  for  each  segment  is  fixed  and  known  a  priori. 

•  The  amoimt  of  time  for  initialization  of  the  segments  in  a  pipelined  program  is 
negligible. 

•  The  execution  of  code  and  communication  can  be  done  simultaneously. 

It  is  noted  that  these  assumptions  do  not  impose  substantial  restrictions  on  the  appli¬ 
cability  of  our  approach  for  grain  size  determination  in  pipelined  programs.  The  first 
assumption  is  realistic  in  that  we  can  easily  obtain  the  one-pass  completion  time  by 
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perfonning  an  analysis  on  the  assembly  language  code  of  the  pipelined  program.  In 
a  large  pipelined  program,  the  amount  of  time  for  the  initialization  of  the  segments 
in  the  pipelined  program  can  be  ignored  because  it  is  too  small  when  it  is  compared 
to  the  amoimt  of  time  for  which  all  the  segments  in  the  pipelined  program  are  active. 
Thus,  the  second  assumption  is  valid.  Due  to  the  existence  of  DMA  (Direct  Memory 
Access)  chips  in  parallel  processing  systems,  such  as  transputers,  the  execution  of  the 
code  and  communication  can  be  done  simultaneously.  Hence,  the  third  assumption  is 
also  valid. 

Algorithm  6.4.1  Determine  Grains  for  Pipelined  Parallelism 

Step  1  Sort  the  segments  using  one-pass  completion  time  as  a  key 
in  the  descending  order. 

Let  t  =  1, 2, . . . ,  n,  be  the  segments  in  the  program. 

Step  2  Find  a  dominant  segment  1  ^  i  ^  n. 

Let  the  preceeding  and  the  succeeding  segments  of  $ ^  be 
$j_i  and  $j+i,  respectively. 

Step  3  If  is  a  communication  segment,  then 
If  E($,_i)  -I-  E($i+i)  <  E($i),  then 

Group  and  to  a  new  computation  segment 

so  that  E($j')  =  E($j_i)  +  E($,+i). 

Delete  ^j_i,  and  $y+i  from  the  list. 

Go  to  Step  2. 

else 

Stop. 


We  will  show  in  the  following  theorem  that  Algorithm  6.4.1  generates  the  optimd 
grain  sizes  for  pipelined  parallelism. 

Theorem  6.4.1  Algorithm  6.4.I  generates  the  optimal  grain  sizes. 

Proof.  When  the  dominant  segment  is  a  computation  segment  since  we  cannot 
divide  the  computation  any  further,  the  grains  cannot  be  further  grouped.  Thus,  the 
dominant  segment  remains  the  same,  and  so  does  the  completion  time  of  the  entire 
program.  Now  consider  a  case  in  which  the  dominant  segment  is  a  communication 
segment.  In  this  case,  the  algorithm  adways  tries  to  group  the  dominant  segment  with 
its  adjacent  computation  segments  to  reduce  the  one-pass  completion  time. 

To  prove  this  theorem,  let  a  solution  for  a  pipelined  program  obtained  by  Algo¬ 
rithm  6.4.1  be  SI.  Assume  that  there  exists  a  solution  for  the  pipelined  program, 
called  S2,  obtained  by  another  algorithm,  and  the  completion  time  by  S2  is  smaller 
than  the  completion  time  by  SI.  In  the  following  we  will  show  that  S2  cannot  exist. 
Suppose  that  is  the  dominant  segment  such  that 

E{^j)  >  £?($*),  for  1  <  j  <  n,  1  <  fc  <  n  and  j  /  k.  (5) 
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Suppose  that  in  S2  a  communication  segment  is  grouped  with  its  adjacent  seg¬ 
ments  ^k+i  and  before  the  dominant  segment  ifj  is  grouped  with  its  adjacent 
segments  $j+i  and  When  and  are  not  adjacent  communication  segments, 
grouping  of  with  its  adjacent  segments  will  not  affect  grouping  with  its  adjacent 
segments.  Thus,  we  need  to  consider  only  the  case  in  which  and  are  the  two 
adjacent  communication  segments. 

Grouping  of  with  its  neighbors  $^4.1  and  $;k-i  implies  that 

£(«»-,) +  £(*1+,)  <£(«»)•  (8) 

Fcom  (5)  and  (6),  we  have 

£:(*i-,)  +  E(**+i)<£(«i).  P) 

When  and  are  two  adjacent  communication  segments  in  S2,  it  may  not  be 
possible  to  group  with  $j_i  and  (=  If  grouping  of  with  $j_i  and 

is  not  possible,  S2  may  not  be  an  optimal  solution  because  such  grouping  may 
be  feasible  in  Si. 

Now,  we  would  like  to  show  that  there  is  a  case  where  in  S2  cannot  be  grouped 
with  and  ^j+i,  but  in  SI  can.  Consider  the  case  of  +  E{^k-i)  < 

E{^j)  <  E{<^j-i)  +  E{^k-i)  +  Since  this  case  does  not  violate  (7),  it  is 

considered  as  a  legal  case.  Then,  from  <  E{9j-i)  +  +  E{6k+i),  in 

S2  will  not  be  grouped  with  its  two  adjacent  segments  because  such  grouping  will 
increase  the  one-pass  completion  time  of  the  segment.  Consequently,  the  completion 
time  of  the  pipelined  program  in  S2  is  determined  by  E{^j).  On  the  other  hand, 
from  E{^j-i)  -h  E{^k-i)  <  in  SI  will  be  grouped  with  its  two  adjacent 

segments  by  Step  3  of  Algorithm  6.4.1,  and  then  the  completion  time  of  the  pipelined 
program  is  determined  by  E{^j-i)  which  is  smaller  than  E(^j).  Thus, 

we  show  that  there  is  a  case  such  that  S2  cannot  reduce  the  completion  time  of  the 
pipelined  porgram,  but  Si  can.  Therefore,  S2  cannot  exist.  Hence,  we  have  shown 
that  Algorithm  6.4.1  can  always  generate  an  optimal  grain  sizes.  This  completes  the 
proof  of  the  theorem. 


Theorem  6.4.2  The  time  complexity  of  Algorithm  6.4- 1  is  O(nlogn),  where  n  is  the 
number  of  segments  of  a  pipelined  program. 

Proof.  Step  1  requires  nlogn  steps.  Using  a  priority  queue  data  structure  to  store  the 
sorted  list,  we  can  retrieve  and  store  any  element  in  logn  steps.  Thus,  each  execution 
of  Step  3  requires  at  most  Alogn  steps  for  three  deletions  and  one  addition  to  the 
priority  queue.  Since  there  are  at  most  n  segments,  the  entire  algorithm  can  run  in 
0{nlogn)  time.  This  completes  the  proof  of  the  theorem. 
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6.5  Modifying  Intermediate  Form 


The  IPR  obtained  from  the  front-end  transformation  consists  of  only  a  set  of  nodes 
which  have  not  been  subject  to  grain  size  analysis.  Before  transforming  the  IPR  to 
the  target  code,  the  IPR  is  modified  to  reflect  the  information  obtained  from  the 
partitioning  and  the  grain  size  analysis.  This  process  is  called  Modifying  Intermediate 
Form  and  consists  of  the  following  two  steps: 

•  Modification  of  the  IPR  using  partitioning  information. 

•  Modification  by  performing  grain  size  analysis. 

The  modified  IPR  is  called  Modified  Intermediate  Form  (MIF)  which  is  a  hierarchical 
intermediate  form  that  is  architecture  dependent.  It  consists  of  the  following  3  levels 
of  representation: 

•  Primitive  node  level  (P-Graph):  This  represents  a  unit  of  code  corresponding 
to  the  body  of  a  control  thread  in  a  processor.  The  P-Graph  consists  of  a  set  of 
primitive  nodes  called  P-Nodes  and  a  set  of  edges  connecting  them.  P-Nodes  are 
the  smallest  computational  unit  in  the  intermediate  form,  such  as  +,  -,  *.  The 
edges  represent  data  dependency. 

•  Macro  Graph  level  (M-Graph):  This  represents  the  multiple  threads  or  process 
level  parallelism  in  a  processor.  The  M-Graph  consists  of  a  set  of  nodes  called 
M~nodes  or  macro  nodes  and  a  set  of  edges  connecting  them.  An  edge  between 
two  nodes  represents  the  existence  of  communication  between  the  two  nodes. 
Each  macro  node  has  a  P-graph  corresponding  to  it. 

•  File  Graph  level  (F-graph):  This  represents  the  processor  level  parallelism 
in  the  application.  The  F-Graph  consists  of  a  set  of  nodes  called  file  nodes  or 
F-nodes  and  a  set  of  edges  connecting  them  where  the  edges  represent  the  commu¬ 
nication  between  the  two  file  nodes.  Each  F-node  has  a  M-Graph  corresponding 
to  it. 

While  being  executed  on  the  parallel  processing  system,  the  parallel  program  must 
be  seperated  into  a  set  of  individual  programs  with  acceptable  grain  size.  Alloca¬ 
tion  must  then  be  performed  on  it  based  on  the  number  of  processors  available  for 
the  program.  There  are  two  types  of  parallelism:  software  parallelism  and  hardware 
parallelism.  Software  parallelism  is  said  to  be  achieved  when  multi-threading  on  a 
processor  is  2u:hieved.  It  is  actually  the  hardware  parallelism  which  gives  one  sub¬ 
stantial  gain  in  performance.  Software  parallelism  is  achieved  at  the  M-Graph  level. 
Furthermore,  the  M-graph  is  subject  to  clustering  and  we  obtain  a  set  of  File  nodes 
called  F-nodes,  where  each  File  node  consists  of  a  set  of  M-nodes.  This  set  of  M-nodes 
is  called  a  M-Graph.  The  M-nodes  in  a  M-Graph  can  be  executed  as  p2urallel  processes 
on  a  processor. 
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The  F-graph  is  a  graph  consisting  of  a  set  of  F-nodes  which  gives  the  number 
of  files  to  be  generated.  Each  F-node  is  executed  on  a  different  processor.  The  F- 
graph  is  directly  mapped  on  to  the  processors.  The  F-Nodes  are  numbered  1  to 
N  where  N  is  bounded  by  the  number  of  processors.  It  can  be  seen  that  with  the 
modified  intermediate  form  we  can  partition  the  program  into  a  set  of  executable 
modules  where  each  module  can  be  executed  on  a  transputer.  The  macro  nodes  can 
become  the  different  threads  executing  concurrently  on  the  same  processor.  The  file 
identification  numbers  are  analogous  to  the  processor  identification  numbers. 

To  illustrate  the  above  feature  let  us  consider  a  simple  example  whose  modified 
intermediate  forms  consists  of  two  F-nodes  and  each  F-node  corresponds  to  four  M- 
nodes.  In  this  case  we  can  generate  two  files  to  be  run  on  two  processors,  each 
having  four  conciuxently  executable  threads.  Let  the  two  files  be  Fq  and  Fi.  The  two 
files  Fo  and  Fi  can  be  mapped  on  to  two  processors  7b  and  Ti.  Assume  that  there  is 
communication  between  To  and  Ti  and  that  To  sends  200  bytes  of  message  to  Ti.  This 
is  equivalent  to  saying  that  there  exists  communication  between  Fq  and  Fj.  We  can 
also  specify  which  M-node  in  Fo  is  the  source  of  the  communication  and  which  M-node 
on  Fi  is  the  destination  for  the  message.  In  such  a  case  we  have  all  the  information 
required  for  communication  between  two  processes,  that  is  the  destination  processor 
and  the  particular  process  on  the  processor  to  which  the  message  is  intended.  We 
can  thus  get  a  one-to-one  mapping  of  the  MIF  into  various  processes  executing  on  the 
processors,  such  as  a  network  of  transputers. 


6.5.1  Modifying  Intermediate  Form  Using  Partitioning  Information 

We  will  illustrate  onr  approach  to  modify  the  intermediate  fonr  with  an  example. 
Consider  Figure  6.10  which  shows  2m  example  output  of  the  p2U'titioning  stage. 
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Figure  6.11:  A  modified  intermediate  form  graph  based  on  partitioning. 


Figure  6.11  shows  six  clusters  and  two  of  those  clusters  have  more  than  one  object. 
Let  us  assume  that  each  of  these  clusters  is  executed  on  a  single  processor.  This 
results  in  more  than  one  control  thread  in  two  of  the  processors  and  is  represented 
by  more  than  one  macro  node  corresponding  to  single  file  node  for  those  processors. 
The  P-Graph  which  consists  of  primitive  nodes  are  executed  serially.  Hence,  there  is 
no  overhead  due  to  communication  while  executing  the  P-Graph. 


6.5.2  Modifying  Intermediate  Form  Using  Grain  Size  Analysis 

We  will  now  illustrate  our  method  by  incorporating  the  information  resulting  from 
grain  size  analysis  in  the  IPR.  Figure  6.3  shows  a  task  precedence  graph.  We  will 
use  our  schedule  shown  in  Figure  6.6  (a)  for  this  task  precedence  graph  to  modify  the 
IPR.  We  see  that  tasks  1,2,9,10,13,14  and  15  are  executed  on  processor  PI.  These 
tasks  are  to  be  executed  sequentially  to  remove  the  overhead  due  to  communication 
and  is  shown  in  Figure  6.12.  There  is  one  macro  node  corresponding  to  each  file 
node  and  each  of  the  macro  nodes  has  their  corresponding  P-Graph.  The  P-Graph 
corresponding  to  the  M-node  of  F-node  Fl  shows  the  precedence  relation  among  the 
tasks  1,2,9,10,13,14,  and  15.  The  P-Graph  corresponding  to  the  M-node  of  F-node 
F2  shows  the  precedence  between  the  tasks  3  and  4.  Similarly,  we  obtfun  the  graphs 
corresponding  to  the  other  F-nodes  shown  in  Figure  6.12.  Grain  size  determination 
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Figure  6.12:  A  modified  intermediate  form  after  grain  sue  determination 


is  used  to  exploit  the  parallelism  within  a  cluster  of  objects  which  are  the  result  of 
partitioning.  Since  F-nodes  represent  the  parallelism  among  the  tasks,  we  can  allocate 
a  cluster  over  more  than  one  F-node  to  represent  the  parallelism  within  a  cluster. 
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Chapter  7 


PROOF/L  Back-end  Treinslation 


In  this  chapter,  we  will  present  the  translation  rules  for  translating  the  program  in  IPR 
to  Inmos  C  of  a  transputer  which  is  our  target  language.  We  also  identify  schemes  for 
detecting  iterations  and  predicates  in  the  IPR  that  can  be  translated  into  a  suitable 
control  statement  in  C.  Finally,  we  will  illustrate  our  translation  rules  with  examples. 


7.1  Translation  Rules 

7.1.1  Simple  Structure 

We  have  identified  eleven  types  of  nodes  in  the  IPR:  simple  function,  identity  fimction, 
constant  function,  copy  function,  macro  function,  latch,  selector,  distributor,  merge, 
construct,  and  split.  The  translation  rules  for  each  of  these  types  of  nodes  are  given 
as  follows: 

1.  simple  function 


A  simple  function  node  represents  a  primitive  function  such  as  •  •,  includ¬ 

ing  boolean  and  logic  operators  and  a  user  defined  function.  Simple  functions 
can  be  translated  based  on  the  number  of  input  and  output  parameters  required. 
The  number  of  input  and  output  parameters  can  be  easily  determined  by  analyz¬ 
ing  the  textual  representation  of  the  IPR.  The  translation  rule  can  be  expressed 
as  follows: 
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For  a  unary  operator, 
o  =  unop  t; 

For  a  binary  operator, 
o  =  *1  hinop  *3; 

For  a  user-defined  function, 

V (il,  •  •  •  »  *in,  •  •  •  1 


2.  identity  function 


An  identity  node  is  translated  so  that  the  input  is  directly  returned  as  the  output. 
The  translation  rule  can  be  expressed  as  follows: 

oi  —  iti  Oj  =  tj;  •  •  •  Ofit  ~  itni 


3.  constant 


A  constant  node  is  translated  so  that  the  specific  same  value  is  produced  all  the 
time.  If  the  value  of  node  is  numeric,  then  tliis  node  is  a  constant  node.  The 
translation  rule  can  be  expressed  as  follows: 


Ox  =  Const;  02  =  Const;  o^  =  Const; 
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4.  copy 


A  copy  node  is  translated  so  that  the  appropriate  number  of  copies  having  the 
same  value  as  that  input  are  produced.  That  is,  it  transfers  its  input  value  onto 
all  of  its  output  edges.  The  translation  rule  can  be  expressed  as  follows: 


Oi  =  t;  02  =  »;•••  Om  =  i; 


5.  macro  function 


A  macro  function  represents  a  compound  function  composed  of  simple  and/or 
macro  functions.  This  function  creates  a  process  for  patrallel  execution.  The 
translation  rule  can  be  expressed  as  follows: 

Pi  =  V (Process  *p,  #  of  parameters,  ti, . . . ,  t,„,  oi, . . . ,  o„); 


6.  latch 


A  latch  node  is  for  sequential  operations.  The  input  data  is  transferred  onto  the 
output  edge  when  the  control  is  fired.  The  translation  rule  can  be  expressed  as 
follows: 


o 


*; 
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7.  selector 


A  selector  node  represents  a  conditional  construction  function  by  combining  other 
nodes.  Conditional  constructions  achieve  selective  routing  of  data  among  inputs. 
Boolean  or  index-valued  data  can  be  produced  by  a  node  that  performs  some 
decision  function.  This  node  with  related  nodes  is  translated  so  that  one  of  the 
inputs  as  an  output  according  to  the  value  of  control  data  is  returned.  The 
translation  rule  can  be  expressed  as  follows: 

•  if  m  =  1, 

if  (c) 

{ 

o  =  ti; 

} 

•  if  m  =  2, 

if  (c) 

{ 

o  =  ij; 

} 

e/se{ 
o  =  t’a; 

} 

•  Otherwise, 

switch  (c) 

{ 

case  1:  o  =  ii; 
case  2:  o  =  12; 

case  m:  o  =  im‘, 

} 
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8.  distributor 


A  distributor  node  is  used  in  conditional  constructs.  It  is  translated  so  that  the 
input  is  passed  to  one  of  the  output  ports  according  to  the  value  of  control  data. 
The  translation  rule  can  be  expressed  as  follows: 

•  if  m  =  1, 

if  (c) 

{ 

Oi  =  i; 

} 

•  if  m  =  2, 

if  (c) 

{ 

Oi  =  t; 

} 

else{ 

02  =  z; 

} 

•  Otherwise, 

switch  (c) 

{ 

case  1:  oi  =  z; 
case  2:  02  =  *; 

case  m:  Om  =  z; 

} 
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9.  merge 


A  merge  node  represents  a  nondeterministic  selector,  which  receives  an  arbitrary 
number  of  input  data  at  a  time  and  returns  one,  which  arrives  first  to  it.  If 
a  merge  node  has  more  than  one  available  inputs  at  the  same  time,  one  of  the 
inputs  is  chosen  as  the  output  arbitrarily  or  by  priority.  The  translation  rule  can 
be  expressed  as  follows: 


o  =  One  of 

10.  construct 


A  construct  node  receives  one  or  more  input  values  and  make  it  as  a  list.  The 
translation  rules  are  given  as  follows: 

oi  =  *i;  02  =  *2;  • '  •  =  i„»;  o  is  list{oi,02,. . .  ,o,„). 

11.  split 


A  split  node  receives  a  list  as  an  input  and  split  that  list  into  values.  The 
translation  rule  c£ui  be  expressed  as  follows: 

Ol  =  *i;  O2  =i2;  ■■■  Om  =  im]  *  IS  list{ii,  t2,  .  .  .  , 
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b)pradicaM 


Figure  7.1:  The  control  structures  for  iteration  and  predicate  statements. 


7.1.2  Schemes  for  Detecting  Patterns 

High  level  languages  support  statements  such  as  iteration  and  predicate  statements 
and  it  is  therefore  advantageous  to  detect  patterns  in  the  IPR  that  can  be  translated 
into  such  high  level  language  control  structure  statements  and  to  reconstruct  the 
IPR.  Iteration  can  be  achieved  through  cyclic  data  flow  graph  in  IPR.  The  body  of 
the  iteration  is  initially  achieved  by  a  datum  that  arrives  on  the  input  of  the  graph. 
The  body  produces  a  new  datum,  which  is  cycled  back  on  a  feedback  path  until  a 
certain  condition  is  satisfied.  A  predicate  in  IPR  can  be  2u:hieved  by  receiving  the 
value  of  boolean  expression  which  is  either  true  or  false.  It  determines  from  which  of 
the  two  outputs  the  data  is  available,  either  then.part  or  else.part.  They  are  shown 
in  Figure  7.1.  The  process  of  detecting  these  patterns  starts  when  the  translator 
encounters  a  “copy”  node  and  tries  to  detect  the  related  nodes.  Then,  iteration  and 
predicate  structures  are  reconstructed  and  translated  as  follows: 


•  Iteration 
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The  above  iteration  control  structure  can  be  tr2inslated  as  follow: 


while  (6) 

{ 

expression; 

} 


•  Predicate 


The  above  predicate  control  structure  can  be  translated  as  follow: 

if  ib) 

{ 

then  —  expression; 

} 

else{ 

else  —  expression; 

} 


7.2  An  Algorithm  Traversing  the  Data  Dependency  Graph 


The  translator  starts  translating  IPR  to  a  target  language  when  a  node  with  no 
predecessor  is  first  reached.  That  is,  the  algorithm  to  traverse  the  graph  proceeds  by 
translating  a  node  in  the  graph  that  has  no  predecessor.  Then,  this  node  together  with 
all  edges  leading  out  from  it  is  deleted  from  the  graph.  These  two  steps  are  repeated 
until  all  the  nodes  are  translated.  An  algorithm  for  traversing  and  translating  each 
node  is  shown  as  follows; 
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/ /  Input:  Intermediate  Program  Representation  G  =  {V,E) 

/ /  Output:  Target  code  executable  on  a  transputer 

1.  P 

2.  while  llVy  >  0  do 

begin 

3.  if  no  node  v  £V  with  no  predecessor  then 

begin  //  parallel  execution 

4.  if  P  is  empty  then 

begin 

5.  if  II  V|(  =  0  then  return 

6.  else  error 

end 

else 

begin 

translate  all  t;  €  P  for  parallel  execution; 
.  remove  (u,  tu)  from  E 

remove  v  from  V 

10.  P  ^  0; 

end 

end 

else 

begin 

11.  if  u  €  V  is  a  macro  function  then  add  u  to  P 
else 

begin 

12.  translate  u; 

13.  remove  (u,  w)  from  E 

14.  remove  v  from  V 
end 

end 

end 


The  above  algorithm  uses  three  sets:  V,  E  and  P.  V  contains  the  vertices  of 
the  given  graph  G.  E  contains  the  edges  of  the  graph  G.  The  algorithm  works  by 
translating  a  vertex  in  G  into  the  target  code.  P  is  used  to  collect  the  vertices  for 
parallel  execution. 

We  can  view  the  algorithm  as  a  sequence  of  operations  that  manipulate  the  three 
sets  V,  E  and  P.  Line  1  initializes  the  set  P.  Line  2,  which  controls  the  main  loop  of 
this  algorithm,  requires  maintaining  a  count  of  the  number  of  vertices  in  the  set  V .  In 
line  3  we  determine  whether  the  translation  for  parallel  execution  is  selected.  Lines 
7-10  represent  the  translation  of  all  u’s  in  P  for  parallel  execution  and  P  is  cleared.  In 
line  1 1  we  determine  whether  the  node  is  selected  for  parallel  execution  and  put  it  into 
set  P.  Lines  12-14  represent  the  translation  of  simple  function  or  non-computational 
nodes. 
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Figure  7.2:  The  IPR  of  gti  and  pui  methods. 


In  this  implementation  two  stacks  are  used.  One  is  for  maintaining  the  list  of 
nodes  with  zero  count,  the  other  stack  is  used  for  maintaining  the  parallel  execution. 
The  deletion  of  all  edges  leading  out  of  a  node  can  be  carried  out  by  decreasing  the 
predecessor  count  of  all  nodes  adjacent  to  it.  Whenever  the  count  of  a  node  drops  to 
zero,  that  node  can  be  placed  onto  a  list  of  nodes  with  zero  count  which  is  maintained 
as  a  stack.  / 


7.3  Examples 


We  will  illustrate  the  back-end  translation  rules  with  two  examples:  one  is  a  Bounded 
Buffer  which  is  communication-oriented  and  the  other  is  to  compute  the  feictorial 
which  is  computation-oriented. 

In  the  Bounded  Buffer  program,  the  methods  get  and  put  will  be  translated  as 
procedures  get  and  put  while  implementing  in  Inmos  C  code  of  a  transputer. 

The  graphical  representation  of  the  IPR  for  the  methods  get  and  put  is  shown  in 
Figure  7.2.  The  get  method  in  IPR  is  started  by  introducing  an  object  “buf”.  The 
split  node  is  now  hreable  and  splits  “buf”  into  two  elements,  “store”  and  “count”. 
The  data  passes  through  the  id  node  and  the  copy  node  is  used  to  duplicate  the  data 
“store”.  There  are  no  edges  connecting  the  tail,  head,  and  dec  nodes  and  this  implies 
that  there  is  no  data  dependency  between  these  nodes.  That  is,  these  nodes  can  be 
executed  in  parallel.  The  results  are  constructed  into  a  list  by  the  construct  nodes. 
The  put  method  follows  similarly. 


83 


The  following  are  the  high  level  description  of  the  Inmos  C  code  generated  for  get 
and  put  methods.  The  Inmos  C  code  produced  here  is  an  unoptimized  version. 


get  (process  *p,  struct  buffer  *6it/,  struct  buffer  *outi,  int  *ou<2) 

declarations  of  process  and  local  variables; 

Vi  =  buf—  >  store; 
v-i  —  bu f—  >  count; 

V3  =  vi; 

Vq  =  V2; 

V4  =  V3; 

Vs  =  V3; 

parallel  execution  of 
tail{v4^  vr); 
dec{v6,vs); 
headiyst  Uio); 
ug—  >  store  =  U7; 
ug—  >  count  =  ug; 
outi  =  Ug; 

*OUt2  =  UlOj 

} 


put  (process  *p,  struct  buffer  *buf,  int  z,  struct  buffer  *outi) 

declarations  of  process  and  local  variables; 

Vi  =  buf—  >  store; 

V2  =  bu f—  >  count; 

V3  =  vi; 

V4  =  V2; 

parallel  execution  of 

append-right{x,  V3,  us); 
inc{v4,  ug); 
outi—  >  store  =  vs; 
outi  —  >  count  =  Vs; 


For  the  factorial  example,  the  IPR  is  shown  in  Figure  7.3.  The  factorial  program  is 
started  by  introducing  an  integer  “i”.  The  data  “i”  of  type  integer  is  duplicated  and 
given  as  inputs  to  the  boolean  expression  part  and  the  distributor  node.  The  output 
of  the  eq  node  is  a  boolean  value  that  indicates  whether  t^e  data  “i”  passes  then.part 
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Figure  7.3:  The  IPR  for  facioriaL 


or  else.psxt.  If  true  is  produced  by  eg.,  the  value  “1”  is  passed  through  as  the  result 
of  the  factorial  method.  If  the  eg  node  produces  false,  the  data  “i”  is  passed  through 
the  copy  node  and  the  factorial  function  is  recursively  called  with  the  input  data  as 
i  -  1  .  At  this  point  the  data  “i”  and  the  result  of  the  called  factorial  function  are 
multiplied  and  returned  as  the  result  of  the  factorial  method. 

The  following  is  the  high  level  description  of  the  Inmos  C  code  generated  for  the 
factorial  procedure. 


fac  (process  *p,  int  i,  int  *out{) 

declarations  of  process  and  local  variables’, 

V3  =  0; 

Vs  =  1; 
vu  =  1; 
vi  =  *; 

V2  -  t; 

Vi  =  vi; 

if  (U3  ==  U4)  { 

ve  =  U2; 
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Figure  7.4:  The  implementation  scheme  of  a  translator  for  PROOF/L  to  the  target  code. 


Vg  =  Vs; 

*outi  =  Vg; 

} 

else  { 

V7  =  V2; 
vio  =  vr; 

Vii  =  V7; 

Vl3  =  Vu  -  V12; 
/ac(vx3,vi4); 
V16  =  vio  *  Vu; 
*outi  =  Vie; 

} 

} 


7.4  Implementation  of  the  Back-end  Translator 

To  translate  the  PROOF/L  code  to  a  target  language,  we  developed  a  two-phase 
translator.  The  front-end  translator  transforms  the  PROOF/L  code  to  IPR  and  the 
back-end  translator  transforms  the  IPR  to  the  target  language  Inmos  C.  Figure  7.4 
depicts  the  two-phase  translation  scheme  of  a  PROOF/L  program. 
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The  back-end  translator  receives  the  IPR  as  the  input  and  generates  the  Inmos  C 
code  that  can  be  executed  on  a  transputer  system  as  the  output.  This  phase  of  the 
translator  mainly  consists  of  the  following  modules:  reading  the  input  from  the  front- 
end  translator,  and  detecting  the  data  dependency  in  the  data  dependency  graph, 
visiting  each  node  in  data  dependency  graph  without  predecessor  nodes,  translating 
such  a  node  into  the  target  code  using  a  suitable  translation  rule.  In  the  following 
section,  the  word  “translator”  implies  the  back-end  translator. 


7.4.1  Reading  the  Input 

Before  translating  the  IPR  to  a  tzu^get  language,  the  translator  preloads  keywords 
in  Inmos  C  and  reserved  variable  names  which  are  used  for  local  variables,  output 
parameter  variables,  channel  variables,  and  process  variables.  The  lexeme  and  token 
representations  for  all  the  keywords  are  stored  in  the  array  keywords,  which  has  the 
entries  consisting  of  a  pointer  to  the  lexemes  •’•ay  zmd  an  integer  denoting  the  type 
of  keywords  stored.  The  operation  init-symbol  inserts  the  keywords  into  a  symbol 
table  and  returns  the  symbol  table  index  for  the  lexeme.  Operations  on  reserved 
variables  are  similar  to  that  of  keywords,  but  the  operations  axe  init  Jocal,  init.out, 
init.channel,  and  init_process,  respectively. 

The  next  thing  that  should  be  done  in  the  translator  is  to  read  the  input  data 
generated  by  the  first  phase  of  the  translator.  The  input  is  represented  in  textual 
form  and  as  a  UNIX  file  which  consists  of  a  sequence  of  intermixed  data  types  - 
characters,  integers,  and  special  symbols.  It  consists  of  five  parts:  class  declaration, 
methods,  passive  object,  pseudo  active  object  and  active  object. 

In  this  implementation,  the  translator  reads  the  IPR  using  the  C-|— I-  stream  input 
and  output  operations.  The  iostream  library  in  C-f-f-  predefines  a  set  of  operations 
for  handling  reading  and  writing  of  the  built-in  data  types.  Furthermore,  file  manip¬ 
ulation  using  the  input  and  output  operations  is  also  supported.  To  link  streams  to 
files,  both  the  header  files  iostream.h  and  fstream.h  must  be  included.  To  open 
a  stream  attached  to  a  file  for  input,  the  translator  uses  the  function  open().  For 
example, 

^include  <iostream.h> 

^include  <fstream.h> 
ifstream  ifile; 

char  lexbuf[MAXSTRING]; 
main(int  argc,  char  *argv[]) 

ifile.open(argv[argc-l],  ios::in); 


Then,  to  read  the  input  from  the  initialized  stream,  the  translator  uses  ifile  >> 
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ifile  >>  lexbuf; 


When  the  translator  is  reading  the  input,  it  distinguishes  the  input  stream  de¬ 
pending  on  keywords  -  Class,  Method,  Passive,  PActive(Pseudo  Active),  Active.  It 
collects  the  input  streams  and  tries  to  build  a  data  dependency  graph  of  each  object 
axxd  method.  Every  node  in  IPR  should  be  represented  by  its  name  which  corresponds 
to  the  operation  performed  by  the  node,  the  number  of  its  predecessors,  its  set  of  pre¬ 
decessors,  and  its  set  of  successors.  The  two  sets  are  conveniently  organized  as  linked 
lists.  Consequently,  an  additional  entry  in  the  description  of  each  node  cont2dns  the 
link  to  the  next  node  in  the  list.  Analogously,  the  set  of  each  node’s  predecessors  and 
successors  is  conveniently  represented  as  linked  lists.  Each  element  of  the  predecessor 
list  and  the  successor  list  is  described  by  an  identification,  local  variable,  and  a  link 
to  the  next  element  on  this  list.  These  data  structures  are  shown  in  Figure  7.5.  If 
we  call  the  data  structures  of  the  node  list  Node  and  the  data  structures  of  elements 
on  the  predecessor  and  successor  chain  InOut,  we  obtain  the  following  declarations  of 
data  types: 

class  Node  { 

char  *name; 

int  Nofinput; 

class  InOut*\np\it] 
class  /nO«i*output; 
class  Node  *next; 

}; 

class  InOut  { 

char  *id; 

char  *LocalVar; 

char  *LocalType; 

class  /nOuf*next; 

}; 


The  translator  transforms  the  input  data  into  a  linked  list  structure  to  build  a 
data  dependency  graph.  This  is  performed  by  successively  reading  the  input  and 
generating  the  node  structure,  its  input  and  output  structures  for  each  node.  These 
structures  must  be  inserted  in  the  list.  Subsequently,  a  new  entry  is  added  in  the  list. 
At  this  time  the  translator  assigns  a  loc2d  variable  to  each  output  structure  so  that  it 
can  be  used  in  code  generation  stage. 

When  it  encounters  EOF,  it  finishes  reading  the  input  and  goes  to  the  next  step. 
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Figure  7.5:  The  data  structure  of  a  data  dependency  graph. 

while  (ifile)  {  //  it  becomes  false  at  end  of  file 
ifile  >>  lexbuf; 

} 

To  disconnect  a  stream  attached  to  a  file  from  the  program,  the  translator  invokes 
the  function  close().  For  example, 

ifile.close(); 


7.4.2  Internal  Representation  of  Data  Dependency 

In  this  stage  the  translator  detects  a  data  dependency  and  gives  a  dependency  between 
nodes  by  making  a  connection  between  id  field  of  InOut  structure  and  its  correspond¬ 
ing  Node  structure.  It  then  assigns  variable  type  -  integer  or  list  -  to  local  variable. 

It  visits  all  elements  of  the  set  of  predecessors  and  successors  of  a  node.  In  each 
element  the  translator  tries  to  find  the  corresponding  node  and  connects  them.  For 
example,  if  an  element  of  the  set  of  successors  of  node  1  forms  the  input  of  node  2, 
then  the  element  of  the  set  of  successors  of  node  1  is  linked  to  node  2.  Similarly,  the 
element  of  the  set  of  predecessors  of  node  2  corresponding  to  the  input  from  node  1 
is  linked  to  node  1.  Figure  7.5  shows  the  data  structure  of  a  data  dependency  graph 
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and  their  connections.  At  that  time  the  translator  assigns  the  local  variable  to  cin 
element  of  the  set  of  predecessors  of  the  node. 

After  establishing  the  links,  the  translator  identifies  specific  patterns  representing 
predicate  and  iteration  control  structures.  The  nodes  representing  the  predicate  are 
replaced  by  a  predicate  node.  The  nodes  representing  the  iteration  are  replaced  by 
an  iteration  node.  These  replacements  are  accomplished  following  the  schemes  for 
detecting  patterns  as  discussed  in  Section  7.1.2. 

Once  these  replacements  are  done,  the  translator  visits  every  node  in  the  modified 
graph  to  fix  the  type  of  data  flowing  along  each  edge  in  the  graph.  In  other  words, 
the  types  of  the  local  variables  are  fixed. 

7.4.3  Code  Generation  of  Method  and  Objects 

In  this  stage  the  translator  generates  the  target  code  for  the  data  structure  represent¬ 
ing  the  classes,  procedures  representing  methods,  procedures  representing  objects, 
and  the  main  procedure.  It  also  generates  the  channel  mechanism  for  communication 
between  objects  and  the  locking  mechanism  for  synchronization. 

The  translator  generates  four  kinds  of  files:  class.h,  method. c,  outXX.c,  and 
main.c.  The  file  class.h  is  a  header  file  which  contains  all  the  data  structures  required 
by  the  target  code. 

As  mentioned  before,  the  source  language  is  PROOF/L  and  consists  of  objects 
encapsulating  data  and  methods.  All  methods  are  identified  from  the  IPR  and  gen¬ 
erated  in  the  file  method.c.  To  generate  the  procedures  representing  methods  from 
the  IPR,  the  first  step  done  by  the  translator  is  to  identify  both  the  input  and  output 
parameters  of  the  procedure  being  generated.  Once  the  parameters  have  been  iden¬ 
tified,  the  translator  produces  the  code  for  variable  declau'ation.  The  remaning  step 
is  to  write  the  code  for  every  method.  This  is  achieved  by  following  the  translation 
rules  mentioned  in  Section  7.1. 

The  files  outXX.c  consists  of  procedures  representing  Passive,  Pseudoactive, 
and  Active  objects.  Henceforth,  we  shall  refer  to  such  procedures  as  passive  proce¬ 
dures,  pseudoactive  procedures,  and  active  procedures,  respectively. 

Passive  procedures  act  like  service  agencies.  They  wait  passively  until  one  of  their 
methods  is  invoked  by  other  procedures.  This  is  achieved  by  using  the  Inmos  C 
construct  ProcAlt().  For  example, 

int  signal; 

signal  =  ProcAlt(chan01,  chan02,  chan03,  NULL); 


ProcAlt  suspends  the  current  process  until  one  of  the  channel  arguments  is  ready 
to  input.  On  completion,  the  functions  return  an  index  into  the  parameter  list  indi- 


90 


eating  the  ready  chsmnel.  In  the  above  example,  it  sets  signal  to  0,  1  or  2  according  to 
which  of  the  three  channels  becomes  ready  first.  In  Inmos  C,  channel  variables  rep¬ 
resent  unidirectioncil  communication  link  between  two  processes.  Channels  between 
processes  are  created  simply  by  declaring  a  variable  type  Channel  *  at  an  appropri¬ 
ate  point  in  the  program.  Channel  input  and  output  functions  are  then  used  to  pass 
data.  Their  functions  must  be  paired  for  two  processes  to  communicate  and  exchange 
data.  Once  the  procedure  sends  data  through  a  channel,  it  waits  for  a  reply.  In  other 
words,  these  procedures  are  blocked  until  they  are  initiated  by  some  other  procedures. 

For  the  sake  of  simplicity,  we  use  a  single-mode  locking  mechanism  in  implemen¬ 
tation  in  order  to  ensure  the  consistency  and  correctness  of  objects.  The  locking 
mechanism  supported  by  PROOF/L  is  as  follows: 

while  (guard  is  false  or  lock  is  true)  ;  //  busy  waiting 
lock  =  true; 

sends  data  through  a  channel; 
receives  data  through  a  channel; 
unlock; 


The  guard  associated  with  the  method  is  evaluated.  If  the  guard  is  False,  repeat 
this  evaluation  until  the  guard  is  True.  That  is,  the  implementation  of  synchronization 
between  objects  uses  “busy  waiting”  to  achieve  mutual  exclusion.  Then,  it  sets  the 
lock  and  communicates  data  through  a  specified  channel.  Finally,  it  is  unlocked. 

Active  procedmes  are  active  initially,  amd  they  may  remain  active  throughout  the 
execution  except  for  occasional  suspensions  for  the  purpose  of  synchronization  with 
other  procedures.  E2u:h  active  procedure  has  its  own  body.  The  bodies  of  procedures 
are  functions  that  may  be  recursive  and  diverse  (non- terminating).  They  do  not  wait 
for  an  initiation  by  some  other  procedure  to  execute  their  body.  These  procedures  can 
have  any  type  of  parameters  with  a  restriction  that  the  first  par2uneter  is  a  Process 
pointer.  Channel  vwiables  for  the  communication  between  this  procedure  and  other 
passive  procedures  are  declared  inside  the  procedure.  An  cictive  procedure  is  capable 
of  spawning  a  number  of  child  processes  and  executing  them  in  parallel.  The  body 
of  an  active  object  is  translated  by  following  the  translation  rules  mentioned  in  the 
previous  section. 

The  third  and  the  final  type  of  procedures  are  pseudoactive  procedures.  A  pseu¬ 
doactive  procedure  is  a  hybrid  procedure  consisting  of  an  active  part  and  a  passive 
part. 

In  the  file  main.c,  the  translator  generates  the  code  for  allocating  the  channels 
for  inter-procedure  communication.  The  code  also  allocates  process  pointers  for  the 
different  types  of  objects  mentioned  above.  Once  these  £ire  done,  the  main  process 
initiates  the  execution  of  these  procedures  in  parallel. 
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7.4.4  Translation  of  Nodes 


In  this  stage  the  translator  generates  the  target  code  by  translating  each  node  in  IPR 
as  follows: 

1.  For  a  binary  function  node:  “mod”,  it  is  translated  by  the 

simple  function  translation  rule. 

2.  For  a  boolean  function  node:  “EQ”,  “or”,  “and”, 

“not”,  “True”,  “False”,  it  is  checked  whether  it  comes  from  the  predicate  or 
iteration  node.  If  true,  it  is  translated  into  the  condition  statement  of  an  “if”  or  a 
“while”  statement.  Otherwise,  it  is  translated  by  the  simple  function  translation 
rule. 

3.  For  an  identifier  node  which  represents  a  user-defined  function,  it  is  tramslated 
to  a  procedure  C2dl  by  the  simple  fvmction  translation  rule. 

4.  For  an  identity  node,  it  is  translated  in  two  different  ways  depending  on  whether 
it  is  in  the  object  or  the  method  graph.  If  the  identity  node  is  in  an  object  graph 
and  has  an  input  from  another  object,  it  is  translated  to  the  channel  mechanism 
to  receive  data  in  the  other  object.  After  generating  the  channel  mechanism,  it 
is  translated  by  the  identity  translation  rule.  Otherwise,  it  is  translated  directly 
by  the  identity  translation  rule. 

5.  For  a  constant  node,  it  is  translated  by  the  constant  node  translation  rule.  That 
is,  the  translator  generates  the  code  assigning  the  constant  value  into  the  output 
local  variable. 

6.  For  a  copy  node,  it  is  similar  to  an  identity  node.  It  is  checked  whether  or  not 
the  input  comes  from  another  object.  If  so,  it  receives  the  input  from  the  object. 
It  is  translated  into  the  channel  construct  in  Inmos  C.  Then,  it  is  translated  by 
the  copy  node  translation  rule.  Otherwise,  translation  is  achieved  directly  by  the 
copy  node  tr2uislation  rule. 

7.  For  a  construct  node,  the  translator  seeu’ches  the  suitable  class  which  collects  all 
the  inputs  into  one  class  type,  and  it  is  translated  such  that  each  input  element 
is  assigned  to  a  corresponding  element  of  the  composition  in  the  class. 

8.  For  a  split  node,  it  is  opposite  to  a  construct  node.  It  is  translated  in  such  a 
way  that  each  composition  element  of  the  input  object  is  assigned  to  each  local 
output  variable. 

9.  For  a  latch,  it  is  translated  in  such  a  way  that  the  second  input  data  is  assigned 
to  the  output  variable. 

10.  For  an  iteration  node,  it  is  translated  to  a  “while”  statement  in  Inmos  C. 

11.  For  a  predicate  node,  it  is  translated  to  an  “one-armed-if”,  “if-then-else”,  or 
“case”  statement  depending  on  the  number  of  inputs. 

12.  For  a  macro  node  which  represents  a  compound  function  composed  of  simple 
and/or  macro  functions,  it  is  translated  to  a  corresponding  process  in  Inmos  C. 
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Chapter  8 


An  Application  Example 


In  this  chapter,  we  give  a  hypothetical  example  to  demonstrate  our  framework.  The 
specification  of  the  example  is  for  the  defense  of  a  fictitious  scenario  of  deployed  air 
force  bases. 


8.1  Specifications  for  the  Defense  of  Air  Force  Bases 

Assume  that  there  are  three  air  force  bases  that  are  closely  connected.  For  the  sake 
of  simplicity,  we  assume  that  only  one  type  of  fighters,  one  type  of  bombers,  one 
type  of  surface  to  air  missile  batteries  for  defensive  purposes  agzunst  the  attacking 
enemy.  Radars  and  C3I  (command,  control,  communication  and  intelligence '  facilities 
are  available.  We  assume  that  the  equipment  in  a  class  has  the  same  effectiveness. 
Effectiveness  of  the  equipment  is  indicated  on  a  scale  of  1  to  100,  and  higher  the 
effectiveness  number,  the  higher  the  effectiveness  of  the  equipment.  Each  base  may 
have  many  radsirs,  but  has  only  one  cuxielated  radcir  value.  Each  base  will  also  have 
several  missile  batteries  and  sufficient  missiles  to  be  used  for  its  defense.  Each  base 
has  either  fighters  only  or  bombers  only,  and  hence  a  base  is  called  either  a  fighter 
bzise  or  a  bomber  base,  respectively.  In  our  example,  we  have  assumed  that  the  bcises 
are  autonomous  in  their  decision  making  process  and  thereby  the  functionality  of  the 
C3I  had  been  embedded  within  the  base.  The  distribution  of  the  aircrafts  in  each  base 
is  given  in  Table  8.1.  Table  8.2  gives  the  effectiveness  values  for  both  the  friendly  as 
well  as  hostile  equipment. 

It  is  assumed  that  the  enemy  aircrafts  move  at  a  speed  of  around  600  miles/hour. 
The  radar  has  a  range  of  about  1000  miles  and  gives  the  composition  of  the  enemy 
cluster.  Each  enemy  cluster  is  composed  of  either  missiles  only  or  a  combination  of 
fighters  and  bombers.  It  is  assumed  that  at  a  given  time  the  enemy  sends  no  more 
than  two  clusters  to  attack  a  base.  Furthermore,  the  enemy  cluster  is  assumed  to 
target  a  particular  base  and  there  is  no  sudden  change  in  the  course  of  any  cluster  by 
the  enemy  to  attack  a  different  base. 
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Table  8.1:  Distribution  of  aircrafts  in  the  base. 


Base 

Aircraft  Type 

Number  of 
Aircrafts 

Base  1 

Bombers 

20 

Base  2 

Fighters 

40 

Base  3 

Fighters 

50 

Table  8.2:  Effectiveness  values  for  the  friendly  and  hostile  equipments. 


Equipment 

Friendly 

Hostile 

Bombers 

60 

60 

Fighters 

70 

75 

Missiles 

85 

80 

If  the  enemy  cluster  consists  of  missiles  only,  then  the  bases  defend  themselves  by 
launching  their  own  missiles  to  intercept  the  incoming  missiles.  If  the  enemy  cluster 
is  composed  of  fighters  and  bombers,  then  the  base  defense  strategy  depends  on  the 
distance  at  which  the  enemy  is  detected.  If  the  enemy  is  detected  at  a  distance 
beyond  300  miles,  then  a  fighter  base  tries  to  defend  itself  by  launching  its  fighters 
and  a  bomber  base  defends  itself  by  requesting  help  from  neighboring  fighter  bases. 
If  the  bomber  base  does  not  get  any  help  from  its  neighboring  fighter  bases,  then  the 
bomber  base  will  defend  itself  with  its  missiles.  For  the  sake  of  simplicity  we  have 
assumed  that  the  bases  are  located  such  that  each  of  them  have  only  one  neighboring 
fighter  base.  If  the  enemy  is  detected  at  a  range  between  50  and  300  miles,  then  the 
bases  will  defend  themselves  by  using  their  missiles.  If  the  enemy  bombers  are  closer 
than  50  miles,  the  enemy  bombers  will  release  the  bombs  and  it  will  be  too  late  to 
defend  the  base.  In  such  a  case,  the  aircrafts  in  the  base  will  fly  out  of  the  base  and 
go  to  the  Mrcraft  shelters  in  a  safer  place,  such  as  beyond  the  enemy’s  range.  We 
assume  that  the  minimum  reaction  time  of  the  people  protecting  the  base  is  about  40 
seconds  to  get  their  aircrafts  ready  to  flee.  In  addition,  we  assume  that  w  aircraft 
can  flee  from  the  base  at  an  average  of  one  fighter  every  5  seconds  or  one  bomber 
every  10  seconds. 

For  a  base  to  defend  itself,  it  should  calculate  the  effectiveness  of  the  enemy  equip¬ 
ment  attacking  it.  The  base  then  calculates  the  number  of  missiles  or  aircrcifts  needed 
to  match  the  effectiveness  of  the  enemy  cluster  and  then  sends  the  required  number  of 
aircrafts  and  missiles  with  at  least  the  same  total  effectiveness.  If  the  base  is  unable 
to  match  the  enemy  cluster  with  its  own  equipment,  the  base  then  requests  help  from 
its  neighboring  base.  The  neighboring  base  may  or  may  not  be  able  to  help  depending 
on  how  many  aircrafts  the  neighboring  base  can  sp2U'e.  Only  fighter  bases  can  offer 
help. 
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8.2 


Object-Oriented  Analysis 


8.2.1  Identifying  Classes  and  O'  jects 

We  identify  the  following  Classes  from  the  requirement  specification  of  the  example: 

•  Bomber-base  -  for  bomber  base. 

•  Fighter_b'oe  -  for  fighter  base. 

•  Radar  -  for  radar. 

•  Shelter  -  a  safe  place  that  the  aircrafts  could  escape,  such  as  beyond  the  range 
of  enemy  aircrafts. 

In  addition,  we  add  the  following  objects  to  the  example  to  keep  track  of  the  operations 
in  the  bases: 

•  Record  -  to  record  the  base  operations 

•  Reporter  -  prints  the  data  store  record. 

From  the  requirement  specification  we  identify  the  following  objects; 

•  hi  -  corresponding  to  one  bomber  base. 

•  f2  and  fS  -  corresponding  to  the  two  fighter  bases. 

•  rl,  r2  and  r3  -  corresponding  to  the  radars  for  the  three  bases. 

•  shelter  -  corresponding  to  a  safe  place  that  the  friendly  aircrafts  fly  to  when  the 
enemy  aircrafts  get  too  close  to  the  base. 

•  record  -  for  recording  base  operations. 

•  reporter  -  for  printing  the  data  store. 


8.2.2  Defining  Class  Interfaces 

Class  interface  of  an  object  consists  of  the  input  and  output  parameters  and  their 
types.  Shown  below  are  the  class  interfaces  of  the  various  objects  identified  in  the 
previous  subsection.  As  an  example  let  us  consider  the  class  Bomber-Base.  We  show 
one  interface  which  is  called  method  put-rad-value.  This  method  is  invoked  by  the 
class  radar.  From  the  domain  knowledge  of  the  software  we  can  infer  that  the  rad2U’ 
value  consists  of  the  following  details: 


•  Number  of  bombers  atacking  the  base. 
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•  Number  of  fighters  attacking  the  base. 

•  Number  of  missies  homing  in  on  the  base. 

•  The  distaince  of  the  enemy  cluster  from  the  base. 

The  type  of  the  data  is  obviously  to  be  integer  and  is  the  same  as  below.  Thus,  in  a 
similar  fashion  the  class  interface  for  various  classes  can  be  determined.  The  classes 
Radsir  and  Reporter  do  not  have  any  methods  that  are  accessible  to  others,  and  hence 
the  classJnterfaces  for  Radar  and  Reporter  do  not  exist.  The  complete  class  interface 
for  a  fighter  base  is  as  follows: 

class  Fight er.base 

method  put.rad.value  (f :Fighter_base,  bombrint,  fght:int, 

miss  tint,  distrint  ->  Fighter.base) 
•called  by  the  radar  to  pass  the  value  of  the  enemy  cluster 
#to  the  base. 


method  help  (  f :Fighter_base,  n:int  ->  int  ) 

•invoked  by  a  neighboring  base  when  it  needs  fighters  from 
•this  base  to  defend  itself. 


•to  monitor  the  number  of  aircrafts  on  groomd. 

method  commit  (  f: Fight er.base,  n:int  ->  Fight er.base) 
method  uncommit  (  f : Fight er.base,  n:int  ->  Fight er.base) 

end  cla.ss 


8.2.3  Specifying  Dependency  and  Communication  Relationships  Among 
Objects 

Once  the  class  interfaces  are  obtained  for  various  classes,  we  can  establish  the  depen¬ 
dency  and  communication  relationship  among  the  objects  from  the  object-oriented 
analysis  phase.  Figure  8.1  gives  the  dependency  and  communication  relationships 
among  these  objects.  To  illustrate  the  operation,  let  us  consider  the  object  rl.  The 
object  rl  puts  a  radar  value  into  the  object  bl  and  after  doing  so  it  records  the  radar 
values.  Thus,  there  exists  communication  between  hi  and  rl  and  between  rl  and 
record. 
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Figure  8.1:  The  object  communication  diagram  for  the  set  of  decomposed  objects  of  the  example. 
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Table  8.3:  Object  Classification 


Classification 

Objects 

Active 

rl,  r2,  r3,  report 

Passive 

record,  shelter 

Pseudo-active 

bl,  f2,  f3 

8.2.4  Identifying  Active,  Passive  and  Pseudo- Active  Objects 

From  the  requirement  specification  and  from  the  object  communication  diagram 
shown  in  Figure  8.1,  we  can  see  that  the  object  rl  does  not  get  invoked  by  other 
objects,  but  invokes  bl  and  record.  Thus,  rl  can  be  identified  as  an  active  object.  To 
illustrate  the  method  for  identifying  psuedoactive  objects,  let  us  consider  the  commu¬ 
nication  behavior  of  the  object  bl,  which  is  invoked  as  well  as  invokes  other  objects. 
Hence,  61  is  identified  as  a  psuedoactive  object.  If  the  communication  behavior  shows 
an  object  being  invoked  only,  then  it  is  identified  as  a  passive  object.  A  typical 
example  is  the  object  Record.  We  can  classify  the  objects  as  shown  in  Table  8.3. 


8.2.5  Identifying  Shared  Objects 

By  analysing  the  communication  diagram  as  well  as  the  object  behavior  for  all  the 
objects  in  this  example,  we  can  identify  the  objects  record  and  shelter  as  shared 
writable  objects.  Shared  objects  are  usually  passive  objects. 


8.2.6  Specifying  the  Behavior  of  Each  Object 

From  the  notation  used  in  Chapter  3,  we  can  describe  the  behavior  of  each  object.  For 
instance,  let  us  consider  the  object  rl.  As  specified  before,  the  object  rl  puts  a  radar 
value  into  the  object  bl  and  then  records  the  radar  value  in  the  object  record.  The 
radar  values  as  mentioned  before  are  1)  the  number  of  bombers  attacking  the  base,  2) 
the  number  of  fighters  attacking  the  b£ise,  3)  the  number  of  incoming  missiles  towards 
the  base  and  4)  the  distance  of  the  hostile  aircrafts  or  missiles  from  the  base.  These 
values  can  be  generated  concurrently.  After  these  values  aie  generated,  rl  records 
these  vadues  in  the  object  Record,  puts  these  values  in  the  object  bl  and  modifies  the 
list  of  radar  values  it  maintains.  These  operations  can  be  done  concurrently  after  the 
values  aire  generated.  Thus,  we  have  the  behavior  of  the  object  rl.  The  behavior  of 
rl  and  f2  objects  are  given  below. 


Behavior  of  object  rl: 

SEq(  C0N(rl .generate_rad_bombervalue,  rl.generate.rad.fightervalue, 


rl .generate.rad.missilevalue,  rl .generate.rad.distancevalue) , 
CON(bl .put.rad.value,  record. base.data,  rl .modify.list) 

) 


Behavior  of  object  f2: 

C0N( 

#passive  part  of  the  base  which  waits  for  the  help  request  from  a 
#neighboring  base. 

0NE-0F(WAIT(f2. help. fl),WAIT(f2. commit, fl),WAIT(f2. uncommit, fl)), 

#active  part  of  the  base  which  works  on  defending  its  own  base. 
SEQ (WAIT (f 2 . put .rad.value ,  r2) ,  f 2 . compute.range , 

SEL( 

#enemy  too  close,  so  escape 

CON (shelter. escape,  record.base.data,  f2. commit), 

#enemy  in  intermediate  range,  so  use  missile  defense, 
record.base.data, 

#enemy  far  away,  so  check  for  enemy  cluster. 

SEQ (f 2 . enemy.clust er , 

SEL( 

#if  missile  attack  use  missile  defense. 

record.base.data, 

fair  craft  attack 

SEL( 

idefend  itself  if  possible 

SEQ  (f  2 .  commit ,  record .  base.dat a ,  f  2 .  \incommi t ) , 

task  for  help  from  neighboring  base 

SEQ (f 3. help, 

SEL( 

tif  base  can  defend  itself  with 
tits  own  aircrafts  and  with  the 
tneighbors  help 
SEQ(C0N(f2. commit ,  f 3. commit), 

'  C0N(record. base.dat a, 

record.base.data) , 

CON (f 3. uncommit ,  f 2. uncommit) ) , 
tif  no  help  available  and  base 
tcannot  defend  itself  with 
taircrafts,  then  use  missile 
tdef ense . 

SEQ (f 3. commit, 

CON(record.base_data, 
record.base.data) , 
f 3 . uncommit) ) ))))))) 
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Table  8.4:  Object  classification  for  the  new  set  of  objects 


Classification 

Objects 

Active 

rl,  r2,  r3,  report 

Passive 

recordl,  record2,  records,  si,  s2,  s3 

Pseudo.active 

bl,  f2,  f3 

8.2.7  Identifying  Bottleneck  Objects 

We  have  already  identified  the  objects  record  and  shelter  as  shared  writable  objects. 
Since  each  of  these  objects  are  accessed  by  the  three  base  objects  bl,  f2,  f3,  the  access 
to  the  objects  record  and  shelter  have  to  be  serialized.  Hence,  record  zmd  shelter  are 
bottleneck  objects.  We  can  split  each  of  them  into  three  objects:  recordl,  records, 
records  amd  si,  sS,  s3.  In  addition,  recordl  and  si  are  associated  with  bl,  records  and 
sS  are  associated  with  bS,  and  records  and  sS  are  associated  with  bS.  Thus,  we  have 
reduced  the  number  of  bottleneck  objects  in  the  system  and  enhanced  the  parallelism 
in  the  program. 

Since  each  of  the  objects  created  in  this  step  is  only  a  copy  of  the  existing  objects, 
the  class  interface  will  not  have  to  be  modified.  The  object  communication  diagram 
will  change  to  reflect  the  new  objects  and  is  shown  in  Figure  8.2.  The  new  active, 
passive  and  pseudo  active  objects  are  given  in  Table  8.4.  The  shared  objects  are  now 
only  recordl,  records  and  records  which  are  shared  by  the  corresponding  radar  and 
base  objects.  This  sharing  is  acceptable  and  will  not  create  a  bottleneck  since  the 
number  of  times  a  radar  accesses  the  record  object  is  small  compared  to  the  number 
of  times  the  base  object  accesses  the  record  object.  The  new  object  behavior  rl  and 
fS  objects  are  as  shown  below. 


Behavior  of  object  rl: 


SEQ(  C0N(rl .generate.rad.bombervalue,  rl.generate_rad_f ightervalue, 

rl .generate.rad.missilevalue,  rl.generate.rad.distamcevalue) , 
C0N(bl .put _rad_ value,  recordl. base.dat a,  rl .modify.list) 

) 


Behavior  of  object  f2: 

C0N( 

#passive  part  of  the  base  which  waits  for  the  help  request 
#from  a  neighboring  base. 

0NE-0F(  WAIT(f2.help,fl) ,WAIT(f 2 .commit ,fl) , WAIT(f 2. uncommit. f 1) ) . 


Figure  8.2:  The  object  communication  diagram  for  the  modified  set  of  objects. 
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factive  part  of  the  base  which  works  on  defending  its  own  base. 
SEQ(WAIT(f2.put.rad.valTie,  r2)  ,  f 2 . compute.range , 

SEL( 

tenemy  too  close,  so  escape 

C0N(s2. escape,  record2.base_data,  f 2. commit) , 

tenemy  in  intermediate  range,  so  use  missile  defense. 

record2 .base.data, 

tenemy  far  away,  so  check  for  enemy  cluster. 

SEQ(f2 .enemy.cluster, 

SEL( 

tif  missile  attack  use  missile  defense. 
record2 .base.data, 
tair  craft  attack 
SEL( 

tdefend  itself  if  possible 
SEQ (f 2 . commit , record2 . base.data , 
f 2. uncommit) , 

task  for  help  from  neighboring  base 
SEQ (f 3. help, 

SEL( 

tif  the  base  can  defend  itself 
twith  its  own  aircrafts  and  with 
tthe  neighbors  help 
SEQ (CON (f 2 . commit ,  f 3 . commit ) , 

CON (record2 . base.data , 
record3 . base.data) , 

CON (f 3 . uncommit ,  f 2 .uncommit)) , 
tif  no  help  available  and  base  cannot 
tdefend  itself  with  aircrafts,  use 
tmissile  defense. 

SEQ(f3. commit , 

CON (record2 . base.data , 
record3 . base.data) , 
f 3. uncommit  ))))))))) 


8.2.8  Checking  for  Completeness  and  Consistency  of  the  Object-Oriented 
Analysis 

By  tracing  through  the  behavior  of  the  objects  and  also  looking  at  the  class  interfaces 
we  can  easily  see  that  the  above  object-oriented  analysis  is  complete  and  consistent. 
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8.3  Object  Design 

8.3.1  Establishing  Hierarchy 

In  this  example,  we  have  identified  the  fighter  base  and  the  bomber  base  as  two 
different  types  of  objects  with  some  common  behavior.  Thus,  we  can  define  a  super 
class  called  Base  which  has  the  information  common  to  both  the  fighter  base  and 
the  bomber  base.  We  can  thus  make  the  Figkter.base  and  Bomber.base  as  derived 
classes  from  Base,  inheriting  all  the  information  of  the  class  Base  in  addition  to  their 
own  special  features.  The  remaining  classes  do  not  form  a  class  hierarchy.  Once  the 
hierarchy  is  identified,  the  methods  of  each  class  are  listed.  The  listing  shown  below 
describes  the  class  heirarchy,  various  methods  in  a  particular  class  and  their  functions. 


class  Base 

method  put_rad_value  (f :Base,  bomb:int,  fghtiint, 

miss:int,  distrint  ->  Base) 

#called  by  the  radar  to  pass  the  value  of  the  enemy  cluster 
#to  the  base. 


method  compute.range  (  f:Base  ->  int  ) 

#returns  a  value  proportional  to  the  range,  such  functions 
#are  used  to  for  the  sake  of  functional  programming  style. 


method  enemy.cluster  (f:Base  ->  int) 

#determines  if  it  is  a  missile  or  aircraft  attack. 


method  effective  (f:Base  ~>  int) 
#effectiveness  of  the  enemy  cluster 


#to  keep  track  of  the  aircrafts  currently  on  ground. 
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method  commit  (  f  :Base,  zi:i&t  ->  Base) 
method  uncommit  (  f:Base,  n:int  ->  Base) 

end  class 

class  Fighter.base  :  Base 

#the  following  methods  return  the  number  of  aircrafts  saved  or 
tdestroyed  while  escaping. 

method  saved_values  (  f: Fight er_base  ->  int) 
method  destroyed.values  (  f: Fighter.base  ->  int) 


method  help  (  f: Fighter.base,  n:int  ->  int  ) 

#invoked  by  a  neighboring  base  when  it  needs  fighters  from 
#this  base  to  defend  itself. 

end  class 

class  Bomber.base  :  Base 

#the  following  methods  return  the  number  of  aircrafts  saved  or 
#destroyed  while  escaping. 

method  saved. values  (  b : Bomber.base  ->  int) 
method  destroyed. values  (  b : Bomber.base  ->  int) 

end  class 


8.3.2  Designing  Class  Composition  and  Methods 

The  class  composition  typically  consists  of  the  local  data  present  in  the  class.  The 
type  of  the  data  present  in  the  class  is  also  identified.  In  this  stage  we  also  provide  the 
methods  present  in  each  of  the  classes.  As  an  example,  consider  the  claiss  composition 
of  the  cIms  Radar.  The  data  in  the  object  aie  lists  of  predefined  values  which  are 
integers  for  bomber  vjdues,  fighter  values,  missile  values  and  the  distance  of  the  enemy 
aircrafts  and  missiles.  These  constitute  the  class  composition.  In  addition  to  these, 
we  define  the  methods.  The  methods  required  for  class  Radar  axe  1)  to  generate  the 
bomber  value,  2)  to  generate  the  fighter  value,  3)  to  generate  the  missle  value,  and 
4)  to  generate  the  distance  of  the  enemy  aircrafts  and  missiles.  These  are  defined 
formally  below. 
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class  Radar 


composition 

bomber : list (int)  X 
f ighter :list (int)  X 
missile: list (int)  X 
distance : list (int) 


tpredefined  list  of  values 
#for  bombers,  fighters,  missiles 
#and  the  distance  the  enemy  is 
•currently  detected 


#  The  following  four  methods  are  used  to  read  the  values 

#  of  the  enemy  cluster  and  distance  from  the  predefined 

#  list  of  values. 

method  generate_rad_bombervalue(r:RadaLr  ->  int) 
expression 

head  r. bomber 

method  generate_rad_fightervalue(r:Radar  ->  int) 
expression 

head  r. fighter 

method  generate.rad_missilevalue(r: Radar  ->  int) 
expression 

head  r. missile 


method  generate_rad_distancevalue(r; Radar  ->  int) 
expression 

head  r. distance 


#  The  following  method  is  used  to  move  the  value  read  in 

#  by  the  above  methods  from  the  Head  of  the  list  to  the 

#  tail  of  the  list. 


method  modify_list(r:Radar,  bomb: int,  figt:int,  miss: int, 

dist.'int  ->  Radar) 


expression 

delete  the  values  from  the  head  of  their 
corresponding  list  and  append  them  to  the 
tail  of  that  list. 


end  class 
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8.3.3  Designing  the  Body  of  the  Objects 


The  body  of  zai  object  describes  th ,  control  thread  within  the  body.  A  control  thread 
exists  for  only  active  and  pseudo-active  objects.  Thus,  the  body  exists  for  only  active 
and  pseudo-active  objects.  In  our  application,  the  body  thus  exists  for  the  objects 
rl,  r2,  rS,  bl,  f2,  fS  and  reporter  since  these  objects  have  been  identified  previously 
as  active  or  pseudo-active  objects.  The  behavior  of  the  active  objects  should  describe 
the  body  of  that  object.  For  example,  the  object  rl  has  a  body  which  iteratively 
executes  in  accordance  with  its  behavior  specified  before  and  this  is  shown  below. 


Body  of  object  rl: 


SEQC  #Assign  values  to  the  radar  object 

R[|  rl  I]  object  radar(  pre.assign  radar  values), 

while (True, 

#Generate  the  radar  values  to  be  passed  on  to  the  base . 

CQN(rl .generate.rad.bombervalue,  rl .generate_rad_f ightervalue, 
rl.generate_rad_inissilev2Q.ue,  rl  .generate_rad_distancevaluG) , 

#Put  the  values  obtained  above  in  the  base  and  record,  emd  then 
•modify  the  radar  values.  This  operation  will  modify  all  the 
•objects  involved. 

C0N(  R[|  bl  I]  bl .put_rad_value, 

R[|  recordl  I]  recordl .base_data, 

RCI  rl  I]  rl .modify_list) 

) 


8.4  Verification 


In  the  first  step,  the  bodies  of  the  active  and  pseudo-2ictive  objects  are  transformed 
into  Petri  nets.  The  transformation  of  the  bodies  of  the  objects  in  this  application 
are  shown  in  Figures  8.3  -  8.10. 

The  second  step  is  to  compose  these  nets  to  reduce  the  number  of  independent  Petri 
nets.  The  nets  are  composed  at  the  fusion  point,  also  called  the  bottleneck  place,  so 
that  shared  modifiable  objects  are  serialized  for  access  among  the  different  objects. 
For  example,  the  object  recordl  is  a  shared  writable  object  that  is  modified  by  the 
objects  rl  and  bl.  Thus  all  the  transitions  in  Figures  8.3  -  8.10  corresponding  to  the 
methods  in  recordl  are  to  be  fused  together  at  the  bottleneck  place.  This  process  of 
fusing  will  bring  the  different  nets  together. 

The  last  step  is  to  refine  the  above  nets  to  reflect  the  details  of  each  method.  This 
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object  ladv 


Figure  8.3:  TVansformation  of  ri  to  Petri  net 


is  achieved  by  expanding  each  transition  to  show  the  guard  and  the  expression.  This 
has  been  illustrated  earlier  in  our  framework.  Once  the  Petri  net  is  obtained,  we  can 
then  apply  the  available  techniques  to  verify  that  the  Petri  nets  satisfies  the  necessary 
properties. 


8.5  Partitioning 


The  software  system  is  decomposed  as  a  set  of  the  following  objects:  Radar  rl,  Radar 
r2,  R3,dax  r3,  Shelter  si,  Shelter  s2,  Shelter  s3,  Report  reporter.  Record  record  1, 
Record  record2,  Record  record3,  Bomber  base  bl.  Fighter  base  fl.  Fighter  base  f2. 
These  objects  are  numbered  from  1-13,  respectively. 


107 


Legend  for  niuitioiit: 

1.  r2.generiie_nd_bamberrilue 

2.  rZgenefaie_rad_fiytaimaloe 

3.  r2.geneiiie_rad_inM(ilevelue 

4.  rZgcnenie_rad_difuncevalue 

5.  f2.poi  r«d_vilue 

6.  recoraz.bafe_dau 

7.  r2jnodify_li(l 


Figure  8.4:  Transformation  of  r2  to  Petri  net 
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Legcod  for  tramitions: 

1.  r3.geperue_rad_boniben'aJiie 

2.  r3.genente_rad_fi|hiefvalue 

3.  r3.genenie  rad.imssilevalije 

4.  r3.genenie_rad_dituncevalue 

5.  /3.pi«rad_  value 

6.  reoonO.base.dau 

7.  r3jnodify_list 


Figure  8.5;  Transformation  of  rS  to  Petri  net 
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Legend  for  transitions: 

1.  recordl  jecotd_info_val 

2.  tecordl  jecord_counter_val 

3.  feporterjecord_value 

4.  lecordl  jecord_init 

5.  fecord2jw»rd_info_va) 

6.  recordZ  jecord_counter_val 

7.  leporter  jecord_value 

8.  tccQrd2jecoid_inil 

9.  record3record_info_val 

10.  r«cord3jecotd_counieT_val 

11.  reporter  jecord_value 
IZ  record3.record_init 

13.  sl.escape_info_val 

14.  sI.escape_counier_val 

15.  reporter  .escape_value 

16.  sl  escape.init 

17.  s2.escape_info_val 

18.  sZescape_countcr_vai 

19.  reporter.escape_val 

20.  sZescape.inil 

21.  s3.escape_info_val 

22.  s3.escape_counler_val 

23.  reporter.escape_val 

24.  s3.escape_init 

25.  reporter.prinl.report 

26.  reporter  .report_init 


Figure  8.6:  Transformation  of  reporter  to  Petri  net 


no 


Legend  for  transitions: 

1.  fSJielp 

2. f3.comnit 

3.  record2.base_dau 

4.  record3.base_data 

5.  fB.uncommit 

6.  fXcooimit 

7.  fS.uncoaunit 

8.  record2.base_data 

9.  record3.base_dau 

10.  G-uncommit 

11.  fB.uncofnmit 


Figure  8.9:  TVansformation  of  f2  to  Petri  net  (cont.) 
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Figure  8.10:  TVansformation  of  f 3  to  Petri  net 
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.  The  input  includes  the  following  three  parts: 


•  Object  behavior  of  the  active  and  pseudo-active  objects. 

•  The  frequency  of  object  invocations  and  the  number  of  data  units  transferred 
between  two  objects  every  time  one  invokes  the  other. 

•  Number  of  replicated  objects. 

1)  The  object  behavior  is  given  as  follows: 

1  :  SEq(  1,  C0N(  11,  8,  1  )) 

2  :  SEQ(  2.  C0N(  12,  9,  2  )) 

3  :  SEQ(  3,  C0N(  13,  10,  3  )) 

7  :  SEq(  8,  9,  10,  4,  5,  6  ) 


11  :  SEq(  WAIT(  1  ), 

11. 

SEL(  CGN(  4,  8  ), 

8, 

SEq(  11, 

SEL(  8, 

SEq(  12, 

SEL(  SEq(  C0N(  12,  8,  9  ), 

fl, 

C0N(  12,  8,  9  )  ), 
SEq(  C0N(  12,  8,  9), 
fl. 

SEq(  12,  8  )))))))) 


12  :  C0N(  SEq(  WAIT(  2  ), 

12, 

SEL(  C0N(  5,  9,  12  ), 

9, 

SEq(  12, 

SEL(  9, 

SEL(  SEq(  C0N(  9,  12  ) , 

12, 

12), 

SEq(  13, 

SEL(  SEq(  CONC  12,  13,  9,  10  ) , 
12, 

C0N(  13,  12,  9,  10  )), 
SEq(  C0N(  13,  9,  10  ), 

12. 
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Table  8.5:  Communication  and  Concurrency  Weights  for  the  Initial  Graph. 


- 

(1) 

lei 

wsm 

KOI 

WSM 

(6) 

HiS 

KH 

WMsm 

(10) 

(11) 

(12) 

(13) 

(1) 

- 

- 

- 

- 

- 

- 

(1.0) 

- 

- 

(1,0) 

- 

- 

(2) 

- 

- 

- 

- 

- 

- 

(1.0) 

- 

- 

(1,0) 

- 

(3) 

- 

- 

- 

- 

- 

- 

(1.0) 

- 

- 

(1.0) 

(4) 

- 

- 

(1.0) 

(0,1/3) 

- 

- 

- 

- 

(5) 

- 

(1.0) 

- 

(0,1/3) 

- 

(1/3, 1/3) 

- 

(6) 

(1.0) 

- 

- 

(0,1/3) 

- 

(0,1/3) 

(1/3, 1/3) 

(7) 

(1.0) 

(1.0) 

- 

- 

- 

- 

(8) 

(0.1/4) 

- 

(7/6,0) 

(0,1/3) 

- 

(9) 

(0,1/4) 

(1/4,7/12) 

(13/12,7/4) 

(0,1/16) 

RDI 

(0,1/) 

(1/.13/12) 

(1,37/24) 

(11) 

(1/2,25/12) 

(0,1/6) 

- 

(12) 

(1/4,2) 

C0N(  13,  9  ))))))))) 

SEQ(  WAITC  11  ).  WAIT(  11  ),  WAIT(  11  ))) 


13  :  C0N(  SEQ(  WAIT(  3  ), 

13, 

SEL(  COMC  6,  10,  13  ), 

10, 

SEQ(  13, 

SEL(  10. 

SEL(  SEQ(  C0N(  10,  13  ) , 
13, 

13  ), 

10))))) 

SEQ(  WAIT(  12  ),  WAITC  12  ),  WAIT(  12  )) 

) 


2)  When  two  objects  communicate  (i.e.  one  object  invokes  the  other),  the  frequency 
of  object  invocation  is  assumed  to  be  10^  and  the  number  of  data  units  transferred 
between  the  two  objects  every  time  is  assumed  to  be  10^.  Let  a  =  10®  and  b  =  -10^ 
in  the  following  discussion. 

3)  It  is  given  that  no  replicated  objects  is  included  in  the  software  system.  As  men¬ 
tioned  earlier,  our  partitioning  approach  can  hzmdle  replicated  objects  used  for  satis¬ 
fying  fault  tolerance  requirements. 

Figure  8.11  illustrates  the  initial  graph  and  Table  8.5  contains  the  communication 
and  concurrency  weights  for  the  initial  graph. 

Table  8.6  contains  the  edge  weights  at  the  end  of  the  normalization  stage. 

In  Table  8.6,  the  edge  incident  to  nodes  (12)  and  (13)  has  a  minimum  weight  with 
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Figure  8.11:  Initial  Graph. 


a  negative  value.  Hence,  the  two  nodes  should  be  clustered.  The  resulting  weights 
are  tabulated  in  Table  8.7. 

In  Table  8.7,  the  edge  incident  to  nodes  (11)  and  (12,  13)  has  a  minimum  weight 
with  a  negative  value.  Hence,  the  two  nodes  should  be  clustered.  The  resulting 
weights  are  tabulated  in  Table  8.8. 

In  Table  8.8,  the  edge  incident  to  nodes  (10)  and  (11,  12,  13)  h2is  a  minimum  weight 
with  a  negative  value.  Hence,  the  two  nodes  should  be  clustered.  The  resulting  weights 
are  tabulated  in  Table  8.9. 

In  Table  8.9,  the  edge  incident  to  nodes  (9)  and  (10,  11,  12,  13)  has  a  minimum 
weight  with  a  negative  value.  Hence,  the  two  nodes  should  be  clustered.  The  resulting 
weights  are  tabulated  in  Table  8.10. 

In  the  Table  8.10,  the  edge  incident  to  nodes  (6)  and  (9,  10,  11,  12,  13)  has  a 
minimum  weight  with  a  negative  value.  Hence,  the  two  nodes  should  be  clustered. 
The  resulting  weights  are  tabulated  in  Table  8.11. 

In  Table  8.11,  the  edge  incident  to  nodes  (5)  and  (6,  9,  10,  11,  12,  13)  has  a 
minimum  weight  with  a  negative  value.  Hence,  the  two  nodes  should  be  clustered. 
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Table  8.6:  Weights  after  Normalization. 


(1)  I  (2)1(3)  (4)  1(5)1  (6)  I  (7)  I  (8) 


0.86 


(9)  (10)  (11)  (12)  (13) 


0.86 

0.86 

-0.48  0.86 


0.86  -0.16 

0.86  -  -0.16 

0.86 

0.86  0.86 
-0.12 


0.29 

-0.16  0.13 


-0.16  -  -0.16  0.13 

1  -0.16 

-0.12  -0.07  0.09  -0.08 

-0.06  -0.41  0.26 

-0.57  -0.08 


Table  8.7:  Gain  Weights  after  the  First  Clustering. 


(4)  I  (5)  1  (6)  I  (7) 


(10)  I  (11)  (12,  13) 


0.86 


0.86  -0.16 
0.86 


-0.16 


0.86  0.86 
-0.12 


0.29 

-0.16 


-0.12  -0.07 
-0.06 
-0.65 


0.13 

-0.038 

-0.16 

0.01 

-0.15 


Table  8.8:  Gain  Weights  after  the  Second  Clustering. 


(7) 

(8) 

(9) 

(10) 

0.86 

- 

- 

- 

- 

0.86 

- 

- 

- 

- 

0.86 

0.86 

-0.16 

- 

0.86 

- 

-0.16 

- 

0.86 

- 

- 

-0.16 

0.86 

0.86 

-0.12 

-0.12 

1 

0.84 

-0.06 

-0.21 
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Table  8.9:  Gain  Weights  after  the  Third  Clustering. 


(7) 

(8) 

(9) 

1  (10,  11,  12 

- 

0.86 

0.86 

- 

- 

- 

1.24 

0.86 

-0.16 

- 

0.29 

0.86 

- 

-0.16 

-0.03 

0.86 

- 

- 

-0.19 

0.86 

0.86 

- 

-0.12 

0.84 

-0.18 

Table  8.10:  Gain  Weights  after  the  Fourth  Clustering. 


(1)  I  (2)  (3)  (4)  (5) 


(7) 

(8) 

(9,  10,  11,  12,  13) 

- 

0.86 

0.86 

1.72 

- 

- 

1.24 

-0.16 

0.29 

- 

-0.19 

0.86 

- 

-0.19 

0.86 

0.86 

0.72 

Table  8.11:  Weights  after  the  Fifth  Clustering. 


(7) 

(8) 

(6,  9,  10,  11,  12,  13) 

- 

0.6 

0.6 

- 

- 

1.72 

- 

- 

1.24 

0.6 

-0.16 

0.29 

0.6 

- 

-0.19 

0.6 

1.72 
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Table  8.12:  Weights  after  the  Sixth  Clustering. 


Table  8.13:  Weights  in  Output  Graph. 
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The  resulting  weights  are  tabulated  in  Table  8.12. 

In  Table  8.12,  the  edge  incident  to  nodes  (4)  and  (8)  has  a  minimum  weight  with 
a  negative  value.  Hence,  the  two  nodes  should  be  clustered.  The  resulting  weights 
are  tabulated  in  Table  8.13. 

In  Table  8.13,  no  edge  has  a  negative  weight.  Thus,  Table  8.13  represents  the 
weights  in  the  output  graph.  The  output  graph  is  shown  in  Figure  8.12. 


8.6  Implementation 

8.6.1  Coding 

Based  on  the  design  of  the  objects,  we  could  write  the  code  in  the  PROOF/L.  After  the 
first  level  of  translation,  we  could  perform  the  grain  size  analysis  on  the  intermediate 
form  generated  from  the  first  level  of  translation. 


8.6.2  Grain  Size  Analysis 

Since  we  are  using  the  router  in  the  application  program,  the  communication  time 
plays  a  major  factor  in  determining  the  grain  size.  From  the  execution  times  of  the 
various  primitive  nodes  discussed  before,  we  find  that  it  is  not  feasible  to  exploit  fine 
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Figure  8.12:  Output  Graph. 

grain  parallelism  on  the  transputer.  This  is  because  the  communication  time  from 
one  transputer  to  another  is  around  26  msec  for  a  single  byte  of  message,  while  the 
execution  times  of  the  primitive  nodes  in  the  intermediate  form  are  fair  less.  Thus, 
we  have  exploited  paredlelism  at  the  object  level  to  improve  the  performance  of  the 
overall  system.  The  intermediate  form  for  the  application  was  modified  such  that  it 
generated  six  files.  In  other  words,  there  were  six  F-nodes  in  the  modified  intermediate 
form.  Thus,  the  application  was  made  to  execute  on  six  trjinsputers. 

8.6.3  Interfacing  With  the  Router 

The  modified  intermediate  form  generated  above  is  then  passed  through  the  back-end 
translator  and  we  get  the  INMOS  C  code.  This  target  code  is  then  interfaced  with  a 
router  which  removes  the  restriction  on  the  feaisibility  of  mapping  Inmos  C  processes 
on  the  network  of  transputers  because  it  allows  the  communication  between  processors 
which  are  not  directly  connected.  To  use  the  advantage  brought  about  by  the  router, 
the  output  of  the  back-end  translator  requires  to  be  interfaced  with  the  router  so  that 
it  can  be  executed. 
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Chapter  9 


Conclusions  and  Future  Research 


In  this  project,  an  architecture  transparent  software  development  framework  based  on 
the  parallel  object-oriented  functional  computation  model  (PROOF)  which  incorpo¬ 
rates  the  functional  paradigm  into  the  object-oriented  paradigm  has  been  developed. 
One  advantage  of  the  object-oriented  paradigm  is  that  it  is  a  unifying  paradigm  where 
the  concepts  of  the  object-oriented  paradigm  are  used  from  the  analysis  phaise  to  the 
coding  phase.  Also,  the  object-oriented  paradigm  reveals  the  nature  of  the  prob¬ 
lem  space  naturally  and  this  facilitates  the  mapping  of  any  real  world  problems  onto 
the  parallel  processing  system.  Our  approach  uses  functional  paradigm  at  the  method 
level,  which  allows  us  to  exploit  massive  parallelism.  Thus,  in  our  approach  the  coarse 
grain  parallelism  can  be  exploited  at  the  object  level  while  the  fine  grain  parallelism 
can  be  exploited  at  the  method  level.  Since  we  have  separated  the  tirchitecture  depen¬ 
dent  issues  from  the  semantics  of  the  program,  the  software  system  generated  is  much 
portable,  thereby  facilitating  the  implementation  of  the  PROOF/L  code  on  different 
target  machines  and  target  languages. 

In  order  to  ensure  that  our  approach  has  distinct  advantages  over  other  methodolo¬ 
gies  in  developing  software  for  parallel  processing  systems,  we  will  strive  to  evaluate 
the  efficiency  of  our  approach  by  comparing  it  with  existing  methodologies,  such  cis 
dataflow  oriented  methodologies,  based  on  criteria  like  code  complexity,  verifiability, 
portability,  software  development  effort  and  availability  of  software  tools.  The  im¬ 
pact  on  the  efficiency  of  our  approach  by  making  it  zirchitecture  independent  will  be 
also  evaluated.  Another  direction  of  research  we  plan  to  do  is  to  identify  the  class 
of  circhitectures  on  which  our  approach  will  be  very  effective.  Some  of  the  classes 
to  be  studied  are  shared  memory  MIMD  machines  and  distributed  memory  MIMD 
machines. 

In  order  to  make  our  approach  useful,  we  will  also  develop  PROOF /L  to  incorpo¬ 
rate  I/O  facilities  and  data  types.  The  I/O  features  include  file  I/O  and  standard 
I/O.  File  I/O  includes  reading,  writing  and  appending  to  files.  The  data  types  include 
arrays,  characters  and  floating  point.  We  also  plan  to  develop  a  translator  for  trans¬ 
lating  the  IPR  to  NCC  (NCube  C),  a  language  supported  by  the  parallel  processing 
system  NCube.  By  doing  this,  we  will  not  only  take  advantages  of  the  available  soft- 
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ware  support  for  NCube,  but  also  be  able  to  evaluate  the  performance  of  the  software 
generated  by  our  approach  on  this  machine.  Furthermore,  CASE  tools  need  to  be 
developed  to  aid  the  software  developer  in  various  stages  in  our  approach,  such  as 
checking  the  consistency  in  the  decomposition  stage,  design  process  and  transforma¬ 
tion  of  the  body  design  into  the  corresponding  Petri-Net  models.  The  tools  are  also 
needed  to  eiid  the  designer  in  the  partitioning  and  the  grzun  size  einalysis  phases.  We 
also  need  to  develop  a  full-fledged  compiler  for  PROOF/L. 

The  performance  of  the  target  code  can  be  improved  by  developing  optimization 
techniques.  We  have  experimented  our  approach  on  various  PROOF/L  programs, 
including  factorial  problem,  bounded  buffer  problem,  dining  philosopher  problem, 
wajehouse  management  problem  and  aur-beise  defense  simulation  problem.  We  have 
compared  the  performance  of  the  Inmos  C  codes  generated  by  our  transformation 
system  with  the  performance  of  the  Inmos  C  codes  written  by  hand  in  terms  of 
the  completion  time.  Because  the  current  transformation  system  does  not  include 
optimization  techniques,  the  performance  of  the  generated  code  is  not  as  good  as 
the  code  implemented  directly  on  the  transputer  systems.  For  example,  we  have 
extensively  tested  pipelined  versions  of  the  factorial  program.  The  speed-ups  of  the 
two  factorial  programs,  the  generated  code  cind  the  directly  written  code  by  hand 
axe  almost  the  same,  but  the  absolute  completion  time  of  the  generated  code  is  at 
best  twice  bigger  than  the  completion  time  of  the  directly  written  code.  When  the 
programs  involve  extensive  manipulation  of  list  data  types,  the  generated  codes  require 
much  more  time  than  the  written  codes.  We  believe  that  it  is  due  to  the  inefficiency 
of  current  list  handling  routines  implemented  in  the  back-end  transformation.  In 
addition  to  the  code  optimization  techniques,  we  need  to  develop  techniques  that 
can  reduce  unnecessary  data  movement  during  object  invocation  for  improving  the 
performcince  of  the  generated  target  code. 
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